Let’s look at the other speech organs. Pay attention to your lips when you alternate between the vowels in boot and book. For boot, you round the lips and protrude them. This adds an air chamber, with its own resonances, to the front of the vocal tract, amplifying and filtering other sets of frequencies and thus defining other vowel contrasts. Because of the acoustic effects of the lips, when we talk to a happy person over the phone, we can literally hear the smile.
Remember your grade-school teacher telling you that the vowel sounds in bat, bet, bit, bottle, and butt were “short,” and the vowel sounds in bait, beet, bite, boat, and boot were “long”? And you didn’t know what she was talking about? Well, forget it; her information is five hundred years out of date. Older stages of English differentiated words by whether their vowels were pronounced quickly or were drawn out, a bit like the modern distinction between bad meaning “bad” and baaaad meaning “good.” But in the fifteenth century English pronunciation underwent a convulsion called the Great Vowel Shift. The vowels that had simply been pronounced longer now became “tense”: by advancing the tongue root (the muscles attaching the tongue to the jaw), the tongue becomes tense and humped rather than lax and flat, and the hump narrows the air chamber in the mouth above it, changing the resonances. Also, some tense vowels in modern English, like in bite and brow, are “diphthongs,” two vowels pronounced in quick succession as if they were one: ba-eet, bra-oh.
You can hear the effects of the fifth speech organ by drawing out the vowel in Sam and sat, postponing the final consonant indefinitely. In most dialects of English, the vowels will be different: the vowel in Sam will have a twangy, nasal sound. That is because the soft palate or velum (the fleshy flap at the back of the hard palate) is opened, allowing air to flow out through the nose as well as through the mouth. The nose is another resonant chamber, and when vibrating air flows through it, yet another set of frequencies gets amplified and filtered. English does not differentiate words by whether their vowels are nasal or not, but many languages, like French, Polish, and Portuguese, do. English speakers who open their soft palate even when pronouncing sat are said to have a “nasal” voice. When you have a cold and your nose is blocked, opening the soft palate makes no difference, and your voice is the opposite of nasal.
So far we have just discussed the vowels—sounds where the air has clear passage from the larynx to the world. When some barrier is put in the way, one gets a consonant. Pronounce ssssss. The tip of your tongue—the sixth speech organ—is brought up almost against the gum ridge, leaving a small opening. When you force a stream of air through the opening, the air breaks apart turbulently, creating noise. Depending on the size of the opening and the length of the resonant cavities in front of it, the noise will have some of its frequencies louder than others, and the peak and range of frequencies define the sound we hear as s. This noise-making comes from the friction of moving air, so this kind of sound is called a fricative. When rushing air is squeezed between the tongue and palate, we get sh; between the tongue and teeth, th; and between the lower lip and teeth, f. The body of the tongue, or the vocal folds of the larynx, can also be positioned to create turbulence, defining the various “ch” sounds in languages like German, Hebrew, and Arabic (Bach, Chanukah, and so on).
Now pronounce a t. The tip of the tongue gets in the way of the airstream, but this time it does not merely impede the flow; it stops it entirely. When the pressure builds up, you release the tip of the tongue, allowing the air to pop out (flutists use this motion to demarcate musical notes). Other “stop” consonants can be formed by the lips (p), by the body of the tongue pressed against the palate (k), and by the larynx (in the “glottal” consonants in uh-oh). What a listener hears when you produce a stop consonant is the following. First, nothing, as the air is dammed up behind the stoppage: stop consonants are the sounds of silence. Then, a brief burst of noise as the air is released; its frequency depends on the size of the opening and the resonant cavities in front of it. Finally, a smoothly changing resonance, as voicing fades in while the tongue is gliding into the position of whatever vowel comes next. As we shall see, this hop-skip-and-jump makes life miserable for speech engineers.
Finally, pronounce m. Your lips are sealed, just like for p. But this time the air does not back up silently; you can say mmmmm until you are out of breath. That is because you have also opened your soft palate, allowing all of the air to escape through your nose. The voicing sound is now amplified at the resonant frequencies of the nose and of the part of the mouth behind the blockage. Releasing the lips causes a sliding resonance similar in shape to what we heard for the release in p, except without the silence, noise burst, and fade-in. The sound n works similarly to m, except that the blockage is created by the tip of the tongue, the same organ used for d and s. So does the ng in sing, except that the body of the tongue does the job.
Why do we say razzle-dazzle instead of dazzle-razzle? Why super-duper, helter-skelter, harum-scarum, hocus-pocus, willy-nilly, hully-gully, roly-poly, holy moly, herky-jerky, walkie-talkie, namby-pamby, mumbo-jumbo, loosey-goosey, wing-ding, wham-bam, hobnob, razzamatazz, and rub-a-dub-dub? I thought you’d never ask. Consonants differ in “obstruency”—the degree to which they impede the flow of air, ranging from merely making it resonate, to forcing it noisily past an obstruction, to stopping it up altogether. The word beginning with the less obstruent consonant always comes before the word beginning with the more obstruent consonant. Why ask why?
Now that you have completed a guided tour up the vocal tract, you can understand how the vast majority of sounds in the world’s languages are created and heard. The trick is that a speech sound is not a single gesture by a single organ. Every speech sound is a combination of gestures, each exerting its own pattern of sculpting of the sound wave, all executed more or less simultaneously—that is one of the reasons speech can be so rapid. As you may have noticed, a sound can be nasal or not, and produced by the tongue body, the tongue tip, or the lips, in all six possible combinations:
Lips
Nasal (Soft Palate Open): m
Not Nasal (Soft Palate Closed): p
Tongue tip
Nasal (Soft Palate Open): n
Not Nasal (Soft Palate Closed): t
Tongue body
Nasal (Soft Palate Open): ng
Not Nasal (Soft Palate Closed): k
Similarly, voicing combines in all possible ways with the choice of speech organ:
Lips
Voicing (Larynx Hums): b
No Voicing (Lrynx Doesn’t Hum): p
Tongue tip
Voicing (Larynx Hums): d
No Voicing (Lrynx Doesn’t Hum): t
Tongue body
Voicing (Larynx Hums): g
No Voicing (Lrynx Doesn’t Hum): k
Speech sounds thus nicely fill the rows and columns and layers of a multidimensional matrix. First, one of the six speech organs is chosen as the major articulator: the larynx, soft palate, tongue body, tongue tip, tongue root, or lips. Second, a manner of moving that articulator is selected: fricative, stop, or vowel. Third, configurations of the other speech organs can be specified: for the soft palate, nasal or not; for the larynx, voiced or not; for the tongue root, tense or lax; for the lips, rounded or unrounded. Each manner or configuration is a symbol for a set of commands to the speech muscles, and such symbols are called features. To articulate a phoneme, the commands must be executed with precise timing, the most complicated gymnastics we are called upon to perform.
English multiplies out enough of these combinations to define 40 phonemes, a bit above the average for the world’s languages. Other languages range from 11 (Polynesian) to 141 (Khoisan or “Bushman”). The total inventory of phonemes across the world numbers in the thousands, but they are all defined as combinations of the six speech organs and their shapes and motions. Other mouth sounds are not used in any language: scraping teeth, clucking the tongue against the floor of the mouth, making raspberries, and squawking like Do
nald Duck, for instance. Even the unusual Khoisan and Bantu clicks (similar to the sound of tsk-tsk and made famous by the Xhosa pop singer Miriam Makeba) are not miscellanous phonemes added to those languages. Clicking is a manner-of-articulation feature, like stop or fricative, and it combines with all the other features to define a new layer of rows and columns in the language’s table of phonemes. There are clicks produced by the lips, tongue tip, and tongue body, any of which can be nasalized or not, voiced or not, and so on, as many as 48 click sounds in all!
An inventory of phonemes is one of the things that gives a language its characteristic sound pattern. For example, Japanese is famous for not distinguishing r from l. When I arrived in Japan on November 4, 1992, the linguist Masaaki Yamanashi greeted me with a twinkle and said, “In Japan, we have been very interested in Clinton’s erection.”
We can often recognize a language’s sound pattern even in a speech stream that contains no real words, as with the Swedish chef on The Muppets or John Belushi’s samurai dry cleaner. The linguist Sarah G. Thomason has found that people who claim to be channeling back to past lives or speaking in tongues are really producing gibberish that conforms to a sound pattern vaguely reminiscent of the claimed language. For example, one hypnotized channeler, who claimed to be a nineteenth-century Bulgarian talking to her mother about soldiers laying waste to the countryside, produced generic pseudo-Slavic gobbledygook like this:
Ovishta reshta rovishta. Vishna beretishti? Ushna barishta dashto. Na darishnoshto. Korapshnoshashit darishtoy. Aobashni bedetpa.
And of course, when the words in one language are pronounced with the sound pattern of another, we call it a foreign accent, as in the following excerpt from a fractured fairy tale by Bob Belviso:
GIACCHE ENNE BINNESTAUCCHE
Uans appona taim uase disse boi. Neimmese Giacche. Naise boi. Live uite ise mamma. Mainde da cao.
Uane dei, di spaghetti ise olle ronne aute. Dei goine feinte fromme no fudde. Mamma soi orais, “Oreie Giacche, teicche da cao enne traide erra forre bocchese spaghetti enne somme uaine.”
Bai enne bai commese omme Giacche. I garra no fudde, i garra no uaine. Meichese misteicche, enne traidese da cao forre bonce binnese.
Giacchasse!
What defines the sound pattern of a language? It must be more than just an inventory of phonemes. Consider the following words:
ptak
plaft
vlas
rtut
thale
sram
flutch
toasp
hlad
mgla
dnom
nyip
All of the phonemes are found in English, but any native speaker recognizes that thale, plaft, and flutch are not English words but could be, whereas the remaining ones are not English words and could not be. Speakers must have tacit knowledge about how phonemes are strung together in their language.
Phonemes are not assembled into words as one-dimensional left-to-right strings. Like words and phrases, they are grouped into units, which are then grouped into bigger units, and so on, defining a tree. The group of consonants (C) at the beginning of a syllable is called an onset; the vowel (V) and any consonants coming after it are called the rime:
The rules generating syllables define legal and illegal kinds of words in a language. In English an onset can consist of a cluster of consonants, like flit, thrive, and spring, as long as they follow certain restrictions. (For example, vlit and sring are impossible.) A rime can consist of a vowel followed by a consonant or certain clusters of consonants, as in toast, lift, and sixths. In Japanese, in contrast, an onset can have only a single consonant and a rime must be a bare vowel; hence strawberry ice cream is translated as sutoroberi aisukurimo, girlfriend garufurendo. Italian allows some clusters of consonants in an onset but no consonants at the end of a rime. Belviso used this constraint to simulate the sound pattern of Italian in the Giacche story; and becomes enne, from becomes fromme, beans becomes binnese.
Onsets and rimes not only define the possible sounds of a language; they are the pieces of word-sound that are most salient to people, and thus are the units that get manipulated in poetry and word games. Words that rhyme share a rime; words that alliterate share an onset (or just an initial consonant). Pig Latin, eggy-peggy, aygo-paygo, and other secret languages of children tend to splice words at onset-rime boundaries, as does the Yinglish construction in fancy-shmancy and Oedipus-Shmoedipus. In the 1964 hit song “The Name Game” (“Noam Noam Bo-Boam, Bonana Fana Fo-Foam, Fee Fi Mo Moam, Noam”), Shirley Ellis could have saved several lines in the stanza explaining the rules if she had simply referred to onsets and rimes.
Syllables, in turn, are collected into rhythmic groups called feet:
Syllables and feet are classified as strong (s) and weak (w) by other rules, and the pattern of weak and strong branches determines how much stress each syllable will be given when it is pronounced. Feet, like onsets and rhymes, are salient chunks of word that we tend to manipulate in poetry and wordplay. Meter is defined by the kind of feet that go into a line. A succession of feet with a strong-weak pattern is a trochaic meter, as in Mary had a little lamb; a succession with a weak-strong pattern is iambic, as in The rain in Spain falls mainly in the plain. An argot popular among young ruffians contains forms like fan-fuckin-tastic, abso-bloody-lutely, Phila-fuckin-delphia, and Kalama-fuckin-zoo. Ordinarily, expletives appear in front of an emphatically stressed word; Dorothy Parker once replied to a question about why she had not been at the symphony lately by saying “I’ve been too fucking busy and vice versa.” But in this lingo they are placed inside a single word, always in front of a stressed foot. The rule is followed religiously: Philadel-fuckin-phia would get you launched out of the pool hall.
The assemblies of phonemes in the morphemes and words stored in memory undergo a series of adjustments before they are actually articulated as sounds, and these adjustments give further definition to the sound pattern of a language. Say the words pat and pad. Now add the inflection -ing and pronounce them again: patting, padding. In many dialects of English they are now pronounced identically; the original difference between the t and the d has been obliterated. What obliterated them is a phonological rule called flapping: if a stop consonant produced with the tip of the tongue appears between two vowels, the consonant is pronounced by flicking the tongue against the gum ridge, rather than keeping it there long enough for air pressure to build up. Rules like flapping apply not only when two morphemes are joined, like pat and -ing; they also apply to one-piece words. For many English speakers ladder and latter, though they “feel” like they are made out of different sounds and indeed are represented differently in the mental dictionary, are pronounced the same (except in artificially exaggerated speech). Thus when cows come up in conversation, often some wag will speak of an udder mystery, an udder success, and so on.
Interestingly, phonological rules apply in an ordered sequence, as if words were manufactured on an assembly line. Pronounce write and ride. In most dialects of English, the vowels differ in some way. At the very least, the i in ride is longer than the i in write. In some dialects, like the Canadian English of newscaster Peter Jennings, hockey star Wayne Gretzky, and yours truly (an accent satirized a few years back, eh, in the television characters Bob and Doug McKenzie), the vowels are completely different: ride contains a diphthong gliding from the vowel in hot to the vowel ee; write contains a diphthong gliding from the higher vowel in but to ee. But regardless of exactly how the vowel is altered, it is altered in a consistent pattern: there are no words with long/low i followed by t, nor with short/high i followed by d. Using the same logic that allowed Lois Lane in her rare lucid moments to deduce that Clark Kent and Superman were the same, namely that they are never in the same place at the same time, we can infer that there is a single i in the mental dictionary, which is altered by a rule before being pronounced, depending on whether it appears in the company of t or d. We can even guess that the initial form s
tored in memory is like the one in ride, and that write is the product of the rule, rather than vice versa. The evidence is that when there is no t or d after the i, as in rye, and thus no rule disguising the underlying form, it is the vowel in ride that we hear.
Now pronounce writing and riding. The t and d have been made identical by the flapping rule. But the two i’s are still different. How can that be? It is only the difference between t and d that causes a difference between the two i’s, and that difference has been erased by the flapping rule. This shows that the rule that alters i must have applied before the flapping rule, while t and d were still distinct. In other words, the two rules apply in a fixed order, vowel-change before flapping. Presumably the ordering comes about because the flapping rule is in some sense there to make articulation easier and thus is farther downstream in the chain of processing from brain to tongue.
Notice another important feature of the vowel-altering rule. The vowel i is altered in front of many different consonants, not just t. Compare:
The Language Instinct: How the Mind Creates Language Page 19