|
[The material below is excerpted from Software Synthesizers, published in 2003 by Backbeat Books.] How can they put a synthesizer in a computer? Believe it or not, computers are not magic. (Nor, occasional appearances to the contrary notwithstanding, are they malevolent entities out to destroy your happiness and turn you into a quivering wreck.) Everything that happens in a computer is ultimately just strings of numbers. The computer appears to be doing other things besides numbers because it handles the numbers very, very, very quickly, and because peripheral devices are attached to the computer that translate the numbers into a form we humans can more readily deal with -- letters of the alphabet, for instance, or pictures, or sounds. A synthesizer is no different. In order for a sound to exist inside the computer, it has to be described as a string of numbers. This is not the place for an in-depth discussion of digital audio, but in order to feel at home in the world of software synthesis you'll need to wrap your brains around a few basics, so let's take a closer look before we move on. Let's start at the very beginning; that's a very good place to start. You probably already knew this, but we live at the bottom of a thick blanket of air. This blanket, which is called the atmosphere, extends many miles upward from the surface of the Earth, growing thinner mile by mile, until it merges with the vacuum of space. If it weren't for gravity, the atmosphere would have dissipated into space billions of years ago, and life never would have emerged on Earth. Gravity is what holds the atmosphere in place -- and while air may seem pretty flimsy stuff, like everything else that's subject to gravity, it has a measurable weight. What's more, there's a lot of it over our heads. At sea level, there's so much air pressing down on us that it exerts a pressure of about 14.7 pounds per square inch. (That's over a ton per square foot.) Except when the wind is blowing, we don't usually notice air pressure, but it's always there. The reason it doesn't squash us all flat is that our innards are pushing outward against our skin with exactly the same amount of pressure. We only notice it when driving up or down a mountain -- those awkward moments when our ears pop because the pressure inside is no longer the same as the pressure outside. Sound consists of rapid (and relatively small) fluctuations in air pressure. These fluctuations originate at some point -- let's say, on the stretched skin of a drumhead when a conga player smacks it with his or her hand -- and spread out in all directions through the air. When first struck, the drumhead moves downward, creating a low-pressure zone on the upper surface. As air molecules rush in from the surrounding area to fill the low pressure zone, the pressure in the surrounding area drops. In effect, the low-pressure zone propagates outward in all directions. A moment later, the drumhead rebounds upward, pushing the air molecules together and creating a zone of higher than normal pressure. Now the same thing happens again, in reverse. The molecules in the high-pressure zone spread out into the surrounding area, which increases the pressure there. In effect, the high-pressure zone propagates outward, following the low-pressure zone. As the drumhead wobbles in and out, the process continues. Until the drumhead comes to rest, the air pressure at the surface keeps changing. That's how sound is created. If you spend any time reading about digital audio, before long you'll run into a diagram that looks more or less like Figure 1-2. This shows more or less what sound would look like if we could see it. The zones of higher and lower pressure are not stationary. They travel through the air, radiating in all directions from the sound source, at about 1,000 feet per second at sea level. When they reach a human ear, the ear senses them and reports to the brain on the exact shape and intensity of the pressure gradient. That's how we hear the conga drum, or anything else. We can replace the ear with a scientific instrument equipped to track changes in air pressure -- in other words, a microphone. A microphone is a type of transducer. That is, it transforms energy from one form to another. In this case, the physical energy of changes in air pressure is transformed into a fluctuating electrical voltage. The voltage fluctuations -- which again would look very much like Figure 1-2 if we could see them -- flow down the cable attached to the microphone. It's essential to understand that the pattern of voltage fluctuations is virtually identical in shape to the pattern of changes in air pressure. A voltage increase corresponds to each increase in air pressure, and vice-versa. I said "virtually identical" because there will always be subtle differences, which depend on the engineering limitations of the microphone we're using. But assuming the microphone is reasonably good, we can ignore the differences for purposes of discussion. The signal coming from the microphone is called an analog electrical signal, or "analogue" if you're in Britain, because the pattern of changes in voltage is directly analogous to the pattern of changes in air pressure. If we plug the mic cable into an amplifier and plug the amp into a speaker, we have our old friend the public address (P.A.) system. The back-and-forth motions of the speaker (another transducer) will produce fluctuations in air pressure that will closely resemble the original fluctuations, only louder. If we plug the mic into a tape deck, we can record the electrical signal and play it back through a P.A. months or years from now. As interesting as this process is from a technical standpoint, this isn't a book about mics, or speakers, or tape decks. It's about software. So we're going to do something more apropos: We're going to plug the mic into a computer. (Hah! You thought we'd never get here.) Trouble is, the computer can't understand or make use of the voltage fluctuations coming down the wire. Computers only understand numbers. In order to get the mic's signal into the computer, we'll have to transform it into a string of numbers. This neat trick is accomplished with a device called an analog-to-digital converter (ADC or A/D for short). The A/D converter, which is typically built into a soundcard, measures the incoming voltage over and over and stores each measurement as a number. Over and over, it sends the most recent number down the line to the computer, takes another measurement, sends it to the computer, and so on. Once inside the computer, the stream of numbers representing the sound can be processed in an almost infinite variety of ways. In order to listen to it, though, we'll have to send the numbers back to the soundcard. Here, the process is run in reverse. The soundcard translates the numbers into a continuously varying electrical voltage using a digital-to-analog converter (DAC or D/A). The voltage is then sent to an amp and speakers, and we hear it as sound. If all goes well, and if the person operating the computer software hasn't been too creative about processing the numbers, we'll recognize the sound coming from the computer as being identical to the sound -- be it a conga drum, a human voice, or an entire orchestra concert -- that first entered the microphone. But the results are not guaranteed. Any number of problems can get in the way, causing the sound to be distorted and mangled -- perhaps subtly, perhaps so radically that it's rendered unrecognizable. In order to insure that the computer produces the desired sounds, the ADC and DAC (to say nothing of the mic and speakers) have to represent the sound waves in an accurate way. The key question, then, is this: How accurate does the computer representation of a sound have to be in order for human listeners to find it acceptable or even enjoyable? Now we're ready to talk specs. The most important factors in producing good-quality digital audio are bit resolution and sampling rate. These terms both refer to the accuracy with which the audio is represented in the form of numbers. You probably know that a movie or a TV picture doesn't actually consist of moving images. It consists of a sequence of still photos. The photos are projected on the screen one by one, but because they succeed one another so rapidly, our brains blend them together into the illusion of a single moving image. A similar process is used to represent a continuous stream of audio as a stream of discrete numbers. A typical movie runs at a rate of 24 or 25 images (called "frames") per second. But the ear is a lot more discriminating than the eye. In order to create a good-sounding digital representation of a sound, we have to take "snapshots" of the fluctuating voltage at least 40,000 times per second. Each snapshot is referred to as a sample or sample word. (The word "sample" has two separate but related meanings, as explained in Chapter 5. In the discussion below, it's used exclusively to refer to a single number, not to a complete digital sound recording.) The rate at which samples are taken is known as the sampling rate. The sampling rate used in music CDs is 44,100 sample words per second. This rate is used by many digital audio programs, including softsynths. These days it's a minimum standard: Many programs run at higher rates, such as 48,000, 96,000, or even 192,000 samples per second. Some older soundcards offer you the option of running at a lower sampling rate, such as 22,050 or even 11,025 samples per second. With a lower sampling rate, the fidelity of the sound will be somewhat degraded. Forget about the microphone for a minute. A softsynth generates its tones from scratch, as strings of numbers, and sends the numbers to the DAC so we can listen to the results. Each and every second, then, a softsynth has to generate 44,100 discrete sample words (if not more). Oh, and that's per note. Play a five-note chord, and the poor softsynth has to generate and process 220,500 samples every second. If that sounds like a lot of number-crunching to you, you're right. That's why computer-based software synths have only become a realistic possibility during the past five years or so. Until computer chips got up into the 200MHz range, software-based synthesis just wasn't practical -- at least not as a real-time proposition, and not if the goal was musically pleasing sounds. On a slower computer, a softsynth can render its audio output to a disk file, in which case it can take as long as it needs to crunch the numbers. But while rendering is a powerful technique that works fine even on a slow computer, you can't play a renderer from a keyboard and hear the music. That's what "real-time" means. Let's go back to what happens at the ADC, when the signal from the mic is first being turned into numbers. We're measuring the signal 44,100 times per second -- but how accurate are those individual measurements? When you're measuring how tall your children are, you probably use a yardstick. The yardstick is most likely marked off in 16ths of an inch. In the backwoods USA, that is. In most of the modern world, it's a meter stick, not a yardstick, and it's marked off in millimeters, but we'll go with the yardstick. If your yardstick were marked off only into feet, with no marks in between, you'd have to record your children as all being two feet tall, three feet tall, four feet tall, and so on. A child whose actual height was between three feet and four feet would have to be recorded as being either three feet or four feet tall, because your measuring system would provide no information more accurate than that. Being human, you're a lot smarter than a computer, so if you were using such a stupid yardstick you'd probably record Suzy's height as "a little more than three feet" or "not quite four feet." But a computer can't do that. For a computer, those in-between measurements don't exist. The computer can only record whole, exact numbers. So it needs to use a yardstick that's as accurate as possible -- a yardstick marked off into a lot of tiny increments. The yardstick for measuring sound is described in terms of the number of bits that can be used to store each sample word. The more bits, the more accurate the measurement. It turns out that eight bits are just about the minimum you need to represent sound acceptably. With an 8-bit ADC, the sound "yardstick" is marked off with 256 small increments. The first-generation samplers I mentioned earlier, the Emulator and the Mirage, recorded and played back sound as streams of eight-bit numbers (bytes, in other words). Eight-bit sound is noticeably harsh and grainy, because the measurements of the sound pressure level are noticeably inaccurate. When inaccuracy creeps into the system, we perceive it as added noise. Sound is stored on music CDs as 16-bit numbers. Sixteen-bit audio has a much cleaner sound (less inherent noise), because the sound waves can be represented much more accurately. In fact, the 16-bit "yardstick" is marked off into 65,536 tiny increments. But why stop there? If 16-bit sound is good, why not use 24-bit sound, or 32-bit sound, or 64-bit? Modern digital audio software, running on a fast computer, often uses 24-bit or 32-bit numbers to represent sound waves. But the computer has to work harder to process larger numbers. When the computer is forced to work too hard, one of two things happens: Either the softsynth says, "Sorry, I can only play seven notes at a time," and flat-out refuses to add any new notes until it finishes with one of the notes it's already playing, or the audio output abruptly fills up with very ugly pops, clicks, and stuttering noises. The audio output might even shut down entirely. When the audio engine in your computer stutters or chokes because it can't spit out enough numbers quickly enough, we say you're hearing dropouts. Asking a softsynth to play too many notes at once is only one possible source of audio dropouts; there are others. On a PC, for instance, your soundcard may be sharing an IRQ (interrupt request) with too many other devices. To prevent dropouts, you may need to move the soundcard physically to a different slot in the computer. (This operation requires some care, however. If you're encountering dropouts, don't just start fooling around in the guts of the machine. Phone your soundcard manufacturer's technical support hotline and ask for their help.) Each time the software developer improves the audio quality by boosting the sampling rate or bit resolution, the audio software can accomplish less before it uses up all of the available bandwidth in the CPU. "Bandwidth," in this case, refers to how many distinct arithmetic operations the CPU can perform per second. Basically, an 800MHz chip can perform twice as many operations per second as a 400MHz chip. Another term that's often used to describe CPU bandwidth is "machine cycles." The more machine cycles a softsynth uses per second, the fewer cycles are left over for the computer to do anything else, such as redraw the screen. Sooner or later, we reach a point of diminishing returns: Improving the audio quality further isn't useful, because the difference to human ears will be very, very subtle, while the degradation in computer performance caused by the amount of arithmetic the software has to execute in real time becomes overwhelming. If the sampling rate is too low, the high frequencies in the sound will get lost. If the bit resolution (also called word length, because each sample is stored as an 8-bit, 16-bit, or 24-bit numerical "word") is too low, the sound will be noisy. That's pretty much all you need to know. It's highly unlikely that your softsynth won't support at least a 16-bit, 44.1kHz data stream, so if you're hearing a poor-quality signal, the source of your problems will probably lie elsewhere. Other forms of digital audio nastiness include: Clipping. There's an absolute limit on how large the numbers in a digital audio system can be. (With floating-point math, this isn't precisely true, but the floating-point numbers will have to be converted back to integers before being sent to your soundcard, so clipping can still become a problem.) If your audio software tries to make a number that's too big, the waveform will reach the maximum possible level and then "clip." In an audio editing program, clipping looks like Figure 1-3. If it's brief, clipping sounds like a pop or click. If it goes on for more than a few milliseconds, it sounds as if the audio is being mangled with a buzz saw. Aliasing. If the softsynth tries to make a sound that's too high in frequency, new overtones will be introduced. A detailed discussion of aliasing would take several pages and several diagrams. Suffice it to say that if a high-pitched sustained tone sounds bell-like when you don't expect it to, or if a tone with vibrato has an unexpected up-and-down whooshing quality, you've got aliasing. The usual solution is to lower the cutoff frequency of the synth's filter, or choose a waveform that has fewer overtones (such as a triangle wave instead of a sawtooth). Floating -- what's the point?Music software companies sometimes toss the term "floating point" into their technical specs. In theory -- and often in practice -- digital audio that uses floating-point calculations sounds better. Without getting too technical, since this isn't a book on computer programming, let's just say that with floating-point math, it's a bit easier to handle large numbers, especially when you don't know in advance how large they're going to be. The opposite of floating-point is fixed-point. Whether, or in what circumstances, 64-bit fixed-point audio sounds superior to 24-bit floating-point audio is a bit like the question of how many angels can dance on the head of a pin, so we'll leave it for the Medieval theologians in the crowd to ponder. Do I need to know programming?Here and there in this book, you'll read references to synthesizer programming and synthesizer programs. As used in this context, the word "program" has very little to do with computer programs and computer programming. A synthesizer programmer is a person who uses the controls provided in the synth to create new sounds. These sounds are then stored in the synth's memory, at which point they're called programs, presets, or patches. (These terms are pretty much interchangeable.) You can be a synth programmer without knowing a solitary thing about computer programming: Familiarity with computer code is not required. For that matter, you can make great music with software synthesizers by using the sound programs provided by the manufacturer. You don't need to program your own sounds. |
(c) 2003 by United Entertainment Media.
All rights reserved.