Kanzi: The Ape at the Brink of the Human Mind

Page 27

by Sue Savage-Rumbaugh

Of course, there also must have been changes in the neuroanatomical systems that controlled these structures. In addition to the proper anatomical design, speech production requires extremely precise and coordinated control of many muscles. Moreover, speech is so rapid that we cannot possibly be producing each sound individually. We are, instead, coarticulating, which means that our mouths have already assumed the shape for the next sound to be made before we have finished producing the first sound. Since speech is infinitely variable, the coarticulation process is never the same from one word to the next unless one repeats oneself. This means that although words sound the same to us when we hear them from one time to the next, they are not really being said in the same manner. They are altered as a function of the speech context in which they occur. It is for this reason that it is so difficult to build a device that interprets speech. Speech is infinitely varied and currently only the human ear can readily find the meaningful units in these infinitely varied patterns. The consonants permit us to accomplish this feat.

Why are consonants important? Couldn’t apes and monkeys simply use the sounds that they can make to construct a language all of their own? The issue is more complicated than it seems. Studies of the vocal repertoires of chimpanzees reveals that they, like many other mammals, possess a “graded” system of vocal communication. This means that instead of producing distinct calls that can easily be distinguished from each other, they produce a set of sounds that grade into each other with no clear boundaries. In one sense, a graded system permits richer communication than a system with fixed calls, which is what characterizes many bird species, for example. The graded vocal system of chimpanzees permits them to utilize pitch, intensity, and duration to add specific affective information to their vocal signals. For example, food calls signal the degree of pleasure felt about the food, as well as the interest in food per se.

More important, these affectively loaded signals are exchanged rapidly back and forth and the parameters of pitch, intensity, and duration serve communicative functions very similar to human speech. We can say a phrase like, “Oh, I am very happy” with such feeling that the happiness almost leaps out of the speaker, or with a cynicism that lets the listener know the speaker is not really happy at all. Graded systems are well designed for transmitting emotional information that is itself graded in content. A feeling of happiness is something like a color in its endless variations. However, a word such as “fruit” or “nut” does not lend itself to a graded system. Words are units of specific information, and while they may themselves generate affect, they are not dependent on the affect for their information-bearing qualities. Consequently, unlike affective signals that constantly intergrade, words have a definite beginning and ending.

If apes indeed are intelligent enough to do so, why have they not elaborated their graded system into one with units, as we have? Unfortunately, vowels are ill-equipped to permit such “packed” communication. Even in human speech, vowels grade into one another, making it impossible to determine where one starts and ends. When tested with a computer-generated sound that slowly transitions from the vowel sound “Ah” to that of “Eh,” humans exhibit a sort of “fuzzy boundary.” There is a large area of the transition space that we label either as “Ah” or “Eh” without much consistency. This is true in trials of different listeners and of the same listener on different trials.

It is consonants that permit us to package vowels and therefore produce a speech stream that can be readily segmented into distinct auditory units, or word packages. Here we experience what is called a “categorical shift.” If the computer presents us with consonants rather than vowels, as in a test where the sound slowly changes from “Ba” to “Pa,” we continue to hear “Ba” until all of a sudden it sounds as though the computer decided to switch to “Pa.” Although the computer has indeed presented us with a gray area of transition, just as it did when it played the vowels, we no longer recognize that it is happening with consonants. It is as though we have an auditory system equipped with filters designed to let us hear either “Ba” or “Pa,” but nothing in between. When we hear a “Ba” it either fits the “Ba” filter parameters or it does not. If it does not fit, we cannot make a judgment about it, as we do a vowel, because we simply don’t hear it as some mixture of “Ba + Pa” to judge.

For some time after this phenomenon of “categorical shift” was discovered, scientists thought that humans alone among mammals possessed the ability to process speech sounds categorically. Moreover, it was widely accepted that this capacity was a genetically predetermined aspect of our auditory system. Even though many scientists recognized that animals could learn to respond to single-word spoken commands, it was assumed that they were doing so on the basis of intonational contours, rather than the phonemic units themselves.

This view held sway until a method was devised to ask animals what they heard as they listened to consonants and vowels that graded into one another. The techniques used in these tests were modeled after those that had been applied to human infants, in which they were asked a similar question regarding their categorical skills. Human infants proved able to categorize consonants in a manner similar to that of adults, a fact that was initially viewed as strong support for the belief that these capacities were genetically programmed into our auditory systems. However, tests with mammals as different as chinchillas and rhesus monkeys revealed clearly that man was not unique in the capacity to make categorical judgments about consonants. Other animals could form acoustic boundaries that categorically differentiated consonants, even though they employed no such sounds in their own vocal systems. Thus, speech sounds are unique to humans only with regard to our ability to produce them, not with regard to our ability to hear them. On recollection, it seems odd that it should have surprised us that auditory systems are capable of far greater sound definition than the organism is able to produce with its vocal cords. After all, we live in a very noisy environment and to get along in the forest, we certainly need to be able to discriminate and make sense out of many sounds that we ourselves cannot produce.

Consonants are rather “funny” sounds. They must, for example, always be linked to a vowel if they are to be heard as a consonant. We cannot separate vowels and consonants in normal speech because it is impossible for humans to say a consonant without also saying a vowel. Thus we cannot utter “G,” but rather must say “Gee” or “Ga” or “Ghuh” or some similar sound. However, it is possible, with the aid of computer, to chop apart vowels and consonants and thus have the computer say “G” in a way that we cannot. To accomplish this, you need only record some speech into a computer using a program that can transform auditory information into visual information. Once you have a picture of the sound on the screen, you can then play back the picture and watch as a time pointer moves through the wave form while listening to Ga, or any other sound you have recorded. If you watch the pointer move through the wave form while listening, you can determine the point at which you do not hear the “G” but are instead listening to the “ah.” If you cut the word at this point and play the two halves, you will find something astonishing. The “ah” sounds like a normal “ah,” but the “G” is not recognizable at all. It sounds like some sort of clicking, hissing noise and you will think that somehow the computer has made a mistake. But you have only to paste this hissing, clicking noise back onto the “ah” sound to hear the “G” sound again, as clear as can be.

What does this tell us? Variations in our perceptions are more a property of the auditory and neurological systems that we listen with than the sound pattern itself. Sharp, short sounds like clicks and hisses are perceived differently from tonal longer sounds. Why would this be? It may be the result of another unusual fact about clicks and hisses: We can localize them extremely well in space, a skill we probably owe to the fact that a broken branch or disturbed leaf can signal the approach of a predator. Most mammals, including ourselves, need to be able to turn in the right direction quickly
and respond without hesitation when such a sound portends danger in the forest. By contrast, longer vowel-like sounds are produced by most mammalian and avian vocal tracts and are used for communicative purposes, not for hunting. We cannot localize such sounds as well. When animals hunt they are quiet, and clicks and scraping noises as they move through the forest are the only clue. Thus it seems that auditory systems have evolved different ways of listening to different sorts of sounds.

The fact that clicks and hissing sounds are so distinct and easily perceived gives them unusual properties when they are linked to vowel-like tonal sounds. The merging together of these two sound types results in what we hear as consonants. Without consonants it is doubtful that we would have spoken language. Why not? The answer is that vowel sounds are difficult to tell apart. It is hard to determine when an “ee” sound turns into an “ii” sound. At the extremes, you can determine which vowel is being produced, but this ability fails us rapidly as one vowel sound begins to grade into another.

The same is not true of short sounds like hisses and clicks. We hear them as discrete staccatolike events well localized in space. When these clicks are merged with vowels, consonants appear and act as the boundaries around vowels that permit us to determine readily where one syllable starts and another stops—so that we hear words as individual units.

It is startling to learn that these things we call words, which we hear as such distinct entities, are really not distinct at all. When we look at a visual wave form of a sentence, we find that the distinctions between words vanish completely. If we pause in our speaking for a break or for emphasis, we see a break in the wave form, but for regular speech, a sentence looks like one continuous word. Thus the units that we hear are not present in the physical energy we generate as we talk. We hear the sound spectrum of speech as segmented into words only because the consonants allow our brains to break the lump down at just the right joints—the joints we call words.

Seen from this perspective, the fascinating thing about human language becomes our ability to produce the actual units of speech. If we did not have the ability to attach clicks to vowels, we could not make consonants. Without consonants, it would be difficult to create a spoken language that could be understood, regardless of how intelligent we were.

It seems odd that the human animal is the only one that has gained the ability to produce consonants. Of course, it is also the case that we are the only animal that is a habitual biped, and the demands of bipedality have pressed some rather important constraints upon our skull. Of course, we paid some prices for these changes. Our small teeth could no longer serve as weapons, and the sharp bend in our throats left us forever prone to choking. But the ability to form consonants readily gave Homo a way to package vowel sounds in many different envelopes, making possible a multitude of discriminable sounds. For the first time in primate evolutionary history, it became physically possible for us to invent a language. I suspect that our intellect had the potential for language long before, but it took the serendipitous physical changes that accompanied bipedalism to permit us to package vowels and consonants together in a way that made possible the open-ended generation of discriminable sound units—the crucial step leading to speech around the world.

These unusual properties of the auditory system are paralleled by similar phenomena in the visual system. Suppose we look at a row of marquee lights flashing off and on. If the time between the flashing of light A and light B is brief enough, we will perceive the lights as a single moving piece of energy. That is, we will not see any breaks or holes in the movement; our brain will fill in the gaps. If the light is slowed down, however, we will perceive it as jumping from one marquee bulb to another, with gaps in between. Thus, at one speed we see only a moving light; at another, we see a jumping light. This visual phenomenon is, like the categorical shift phenomenon, a property of the visual system of many primates as well.

Given the perceptual constraints of the auditory system, it is evident that the appearance of language awaited the development of a vocal system capable of packaging vowel sounds with consonants. Regardless of brain size, if the vocal system of the organism could not produce consonants, it is not likely that language would emerge. The majority of land-dwelling mammals are quadrupedal and consequently have retained the sloping vocal tract designed to modify vowels, to convey affect, and to enable them to swallow easily without choking. This elongation makes rapid consonant-vowel transitions physically implausible, even if the neuro-circuitry were to permit the ape to attempt it.

What of early hominids—would their vocal tracts have permitted them to produce consonants and thus package their vowels into discriminable units of sound? Edmund Crelin has constructed model vocal tracts for Australopithecus, Homo erectus, Neanderthals, and other archaic Homo sapiens. While the reconstruction of soft tissue is always difficult, and the testing of a rubber mold is also subject to numerous subtle variations, such tests are nonetheless the current best way to approximate the speech capacities of extinct species. Crelin concluded that the ability to produce vowel-like sounds typical of modern speech would not have appeared until the advent of archaic Homo sapiens, around two hundred and fifty thousand years ago. These creatures had a brain capacity similar to our own.

If Crelin is correct, then language cannot have been responsible for the creation of Homo sapiens. Rather, it appears that gaining the vocal tract that made language possible may simply have been a free benefit as we evolved into being better bipeds. How we achieved the fine neuroanatomical control required to orchestrate the co-articulatory movements and the voluntary respiratory control to operate our vocal tract, however, remains something of a mystery.

At the Wenner-Gren conference several people felt uncomfortable with Kanzi’s command of language comprehension. They were able to take satisfaction, however, when I acknowledged that it would probably be more difficult to teach Kanzi to tap dance than to use a keyboard-based language system. Their relief was because they felt that there might be a link between the highly developed motor skill that we use to tap dance and the similar motor-planning routines required by speech. Moreover, the idea of increased motor skill and planning as an evolutionary engine interested those, such as Patricia Greenfield, who saw evolutionary links between the development of tool-use skills and language—and suggests a common neurological substrate for both.

Evidence for the putative neurological link between tool-use skills and language includes the fact that certain brain areas, such as the inferior parietal association area, are involved in both object manipulation and language; in this case, object naming and grammar. This kind of overlap is clearly seen in certain patients who have suffered damage to their Broca’s area. Depending on the nature of the damage, such people may be unable to construct grammatical sentences. In other words, they are unable to assemble words in a hierarchical manner. These same people are unable to sketch out simple hierarchical patterns, using short sticks and a model to copy. In experiments with young children, Patricia tracked the emergence of the hierarchical concept with age. She asked the children to copy a symmetrical model, again building a two-dimensional structure using short sticks. Although children aged seven and older built the structure along hierarchical lines, younger children did it piecemeal, focusing on one local area of the structure at a time.

The implication of this and other evidence is that, rather than arising in neurological isolation, as the Chomskian position argues, language abilities are intimately related early in life to abilities of object manipulation. “The ontogenesis of a tool-use program relies on Broca’s area in the left hemisphere of the brain, just as early word formation does,” observes Patricia. “This is the key point in relation to tools and language: they have a common neural substrate in their early ontogenetic development… . These programs differentiate from age two on, when Broca’s area establishes differentiated circuits with the anterior prefrontal cortex.”4

If this is indeed the case, then it is legit
imate for archeologists to look for evidence of linguistic abilities encrypted in stone-tool assemblages. Changes in the cognitive sophistication of tool assemblages should—in some way—also reflect changes in linguistic capacity. The problem is, there is no objective method for assessing the degree of cognitive sophistication embodied in mute stones, as I learned from the disagreements voiced among the archeologists present at the Wenner-Gren conference in Portugal. When, almost two decades ago, Glynn Isaac tackled the challenge of looking for signs of language in tool technology, he first looked at the overall picture between two and a half million years ago and a little more than thirty thousand years ago. This perspective on the trajectory of language evolution through the past two and a half million years led him to rather different conclusions from those derived from the anatomical evidence produced by the reconstruction of vocal tracts and observations of brain expansion and organization. Isaac concluded that the initial stage of stone-tool technology, the Oldowan, produced between two and a half and one and a half million years ago, implied “designing and symbolizing capabilities … not necessarily vastly beyond that of contemporary [apes].”5

Tom Wynn and Nick Toth have looked at the same evidence more recently, with somewhat different results. As I indicated in Chapter 8, Tom believes the Oldowan tool technology was essentially within the cognitive reach of an ape. “In its general features Oldowan culture was ape, not human,” he concluded. “Nowhere in this picture need we posit elements such as language… .”6 Nick, on the other hand, states that the “Oldowan tool makers were not just bipedal chimpanzees.”7 Nick bases his conclusions on the fact that the earliest tool-makers apparently had mastered the principles of concoidal fracture, that is, the searching out of appropriate angles for striking platforms and the delivering of appropriately angled blows with a hammerstone. This mastery, says Nick, implies a cognitive competence beyond that of apes, and may be taken as evidence of some linguistic ability.

‹ Prev Next ›