Passing the buck
It might be tempting, at this point, to sit back and think that psycholinguists now know how babies come to recognize words. Certainly, the evidence is sufficient for us to begin to understand how the process gets under way. Babies are clever, and know quite a bit even by the time they are born. But whereas we may now understand better how the baby comes to develop some sort of knowledge about syllables, and hence words, we have simply attributed the acquisition of this knowledge to the acquisition of some other, perhaps more fundamental knowledge concerning prosody. And how did that knowledge develop? It is not enough to say that it developed simply because the information was there in utero. Why is that, just by hearing a bunch of sounds, the baby should learn about the structure of those sounds?
A theory that says babies learn everything they need to know about their language after they are born is no different, really, from a theory that says they learn some of it before they are born. Even if they do not learn anything prenatally, and instead they learn about the prosodic characteristics of the language postnatally, the problem is still the same; how do they learn, and why do they learn what they do? After all, put a computer with a microphone in a roomful of people, and unless it contains a relevant program (and how did that get there?), it will not do much at all, let alone learn about the language that the people speak.
Psycholinguistics itself does not have a ready answer to these fundamental questions. Fortunately, what look like the beginnings of some answers have developed in the field of computational modellingusing computers to mimic the learning process. In Chapter 13 we shall explore some of the advances made in this field, and the ways in which they can shed light on some of the fundamental questions that remain to be answered.
We have failed so far to consider a question which bears fundamentally on all these issues: is there something about our ability to perceive speech that is uniquely human? Do we have some sort of genetic predisposition to understand speech in a certain way? The next chapter considers this question in the context of our ability to distinguish not between syllables, but between the sounds that they are composed of.
Chinchillas do it too
In a review of infant speech perception work written in the early 1980s, Peter Jusczyk suggested that instead of trying to find out what babies can do, we should instead try and find out what adults can do that babies cannot do. One big difference between adults and babies is that adults have acquired knowledge about individual words; they have a mental dictionary, or lexicon. And obviously, not having any experience of the specific words in the language will severely limit the babies' abilities. But are there any other limitations on babies' abilities that have nothing to do with this lack of a lexicon? At the time of Jusczyk's review, it was beginning to look as if, in terms of speech perception abilities that do not require a lexicon, babies could do most things that adults could do. This caused a number of researchers to wonder whether this remarkable ability might not be due to innate knowledge about the language. Even before it was discovered that newborns seemed to be able to organize what they heard in terms of syllables, a much more fundamental ability had been discovered which, it seemed, could only be explained by assuming a genetic component. So fundamental is this ability that we take it even more for granted than probably any other feat of human speech perception; it is simply the ability to tell one sound apart from another.
The fact that babies can distinguish between [p] and [t] is surely uninteresting-is this ability any more remarkable than the ability to tell colours apart, or to see different shades of grey? Being able to discriminate colour is a property of our visual perceptual system that occurs because we have cells in the retina of the eye which are sensitive to different wavelengths of light. This does not mean that we have innate knowledge of colour, but it does mean that there has probably been some evolutionary pressure that has selected for, and favoured, mutations which enabled organisms to discriminate between different kinds of light. So why is the ability to discriminate between [p] and [t], or [p] and [b], any different? What is there about this ability that could possibly be interesting?
These sounds, or phonemes, differ from one another in very subtle ways. Phonemes are the smallest parts of spoken words which, if changed, can lead to the creation of new words. So 'bat' and 'pat' differ by just one phoneme. As do 'speech' and 'peach'. But it is the subtlety of the differences that makes it all the more remarkable that we can tell phonemes apart.
The phonemes /b/ and /p/2 _link_ are almost identical; both are produced at the lips (they start closed and finish open), but in producing the first, the vocal folds vibrate, and in producing the second, they do not. So the sounds [ba] and [pa] differ in that the vocal folds start to vibrate sooner in [ba] than in [pa]. The difference is small; in [pa] the vocal folds start to vibrate around 40 milliseconds after the lips open, whereas in [ba] they start to vibrate within around 20 milliseconds after the lips open. This difference in the onset of vibration of the vocal folds is normally referred to as a difference in the voice onset tine.
So far, nothing has been said that makes it surprising that either adults, or infants, can tell these different sounds apart; all we need to suppose is that our perceptual system is sufficiently finely tuned that we can tell that there are tiny differences between the sounds. But it is not quite so simple. First of all, what we think we hear is not necessarily what we have actually heard. In a now classic demonstration of this, Harry McGurk, whose interests in social issues and child development led him to become Director of the Australian Institute of Family Studies, and a student of his, Janet MacDonald, showed people a video recording of a speaker saying the sound [ba]. However, they replaced the original sound with the sound [gal. So what people saw in this experiment was a lip movement compatible with [bal, but what they were played was the sound [gal. Would they report hearing [ba] or [gal? In fact, they reported neither. What they said they heard was [da]-a sound which is midway between [ba] and [gal. In effect, people experiencing the `McGurk effect' take the two sources of evidence, visual and auditory, and combine them to perceive an illusory sound that is between the originals.
Of course, most of what we perceive is determined by the sounds entering the ear (even the illusory sounds in the McGurk effect are determined, in part, by the actual sounds played on the video). But the McGurk effect demonstrates that on occasion we hear something that is not in fact in the speech signal. This can be contrasted with occasions when we fail to hear something that is. It is this failure that allows us to recognize speech as effortlessly as we do.
Different sounds, same sensation
Different versions of /b/ may have different delays between the opening of the lips and the onset of vocal fold vibration (the voice onset time). This difference in voice onset times between the two versions can be as great as the difference between a /b/ and a /p/, or between two versions of /p/. Despite this, we do not think of the two versions of the /b/ as being any different. More importantly, as Al Liberman and his colleagues first discovered in the late 1950s at the Haskins Laboratories in Connecticut, we cannot even perceive the two versions of the /b/ as being any different. As we shall see, this finding has some puzzling, and important, consequences.
The effects studied by Liberman involved the creation of artificial versions of /b/ and /p/ which differed in voice onset time by varying degrees. In fact, a whole continuum of sounds can be created, with sounds at one extreme having a voice onset time of zero and sounds at the other extreme having a voice onset time of, for instance, 60 milliseconds. When sounds taken from this continuum are played to people, they can easily say which sounds belong to the /b/ category, and which belong to the /p/ category. Generally, anything with a voice onset time less than 20 milliseconds is classified as a /b/, and anything with a voice onset time greater than around 40 milliseconds is classified as a /p/. The importal hart of the experiment is that if pairs of sounds are presented one after the other, and people are asked to say
whether the two sounds are at all different, they can only spot a difference if the sounds are from either side of the boundary. If the pairs are from the same side of the boundary (i.e. both are between 0 and 20 milliseconds or between 40 and 60 milliseconds), they cannot tell that there is any difference in the sounds. In fact, there can be a bigger difference in voice onset time between sounds taken from the same side of the boundary, which cannot be distinguished, than between sounds that straddle the boundary and which can be told apart. This phenomenon is called categorical perception; only speech sounds that are perceived as belonging to different phoneme categories can be discriminated.
This result has been replicated many times, and with different phonemes; including, for instance, a continuum of phonemes with /p/ at one end, /t/ towards the middle, and /k/ at the other end, as well as the continuum /b/-/d/-/g/. There is no vocal fold vibration in the /p/-/t/-/k/ continuum whereas there is in the /b/-/d/-/g/ continuum. In other respects, the two continua are identical. The different sounds along each continuum do not differ in terms of voice onset time, but in terms of subtle differences in the changing frequencies at the beginnings of each sound. These differences reflect the different positions at which the mouth is closed off when the phonemes are uttered.
Again, nothing has been said that is too surprising; we need simply assume that adults have somehow learned which sounds are relevant to their language and have learned to ignore differences in sounds that are irrelevant. So in English, phonemes with voice onset times less than around 20 milliseconds are /b/s and phonemes with voice onset times more than around 40 milliseconds are /p/s, and any other differences are irrelevant; that is, any other differences do not discriminate between different words in the language. Whereas [bat] and [pat] are different words, [bat] with voice onset time of 0 milliseconds and [bat] with voice onset time of 20 milliseconds are both `bat'. This view, based on a notion of relevance, is corroborated by the finding that the speakers (and hence hearers) of different languages may find different differences relevant. For instance, English speakers have only two categories (that is, one boundary) along the voice onset time continuum. Thai speakers, on the other hand, have three categories (and therefore two boundaries). Similarly, Japanese speakers do not have a clear boundary at all in the /1/-/r/ continuum, while English speakers do. These differences across languages reflect the fact that whereas in one language, a difference in two sounds may not indicate a different word, in another language it might.
The most obvious explanation for why we perceive phonemes categorically is that we simply learn, on the basis of what we hear around us, which differences (in voice onset time, or in frequency, or whatever) should be attended to, and which differences should be ignored. It is so obvious, that it came as quite a surprise to discover, as Peter Eimas and colleagues did in the early 1970s at Brown University, Rhode Island, that the same phenomenon can be found to occur with one-month-old infants. Could infants learn enough about the sounds in their language that they could already, by the time they are just one month old, ignore irrelevant differences? And how could the infant possibly know which differences are relevant if it has not yet acquired knowledge about which are the different words in its language?
The puzzle was all the more intriguing when it was found that infants were sensitive not only to differences relevant to their own language, but to differences relevant to other languages as well. So an English baby would be able to discriminate between sounds relevant to Thai which adult English speakers would not be able to discriminate between. However, the baby would not be able to discriminate between any two sounds; it would still only be able to discriminate sounds that straddled the relevant boundaries. Not surprisingly, many researchers concluded on the basis of these findings that humans are born with some special apparatus that is geared not simply to recognizing only phonemes (and in so doing, ignoring irrelevant differences), but to recognizing any phoneme that could potentially occur in any language.
The ability to discriminate between differences relevant to other languages, but irrelevant to the mother language, is lost after around 10 months. It is at this stage that infants become `adult' with respect to how they recognize phonemes. Presumably, we learn in those first 10 months to ignore the differences which, although recognizable in principle, are irrelevant and unused in the environment around us.
A uniquely human ability?
Recognizing phonemes is apparently as fundamental and predetermined as recognizing, for instance, movement. In the late 1950s and early 1960s it was established that when the eye sees a moving image, cells in the brain respond to the movement of the object, irrespective of the size or shape or colour of whatever it is that is moving (their response is measured in terms of what is basically electrical activity). This led to the notion that these cells were specialized for detecting motion. A wide range of visual phenomena were subsequently investigated which supported this hypothesis (including certain kinds of visual illusion). The relevance to phoneme detection is that various phenomena concerning the perception of phonemes were discovered that were analogous to these visual phenomena. And because, in the visual system, these phenomena were taken as indicating that we have specialized motion detectors, the equivalent phoneme perception data were taken as indicating that we have specialized phoneme detectors, that respond to certain sounds and not to others in much the same way as the different cells in the retina respond to different wavelengths of light. And because speech is a uniquely human characteristic (other animals communicate vocally, but they do not do so with human speech sounds), it was supposed that evolution had somehow equipped humans with a perceptual apparatus that was specifically pretuned to human speech.
It soon transpired, however, that this view of a pretuned, preprogrammed perceptual mechanism was unlikely to be right. First, certain non-speech sounds can be perceived categorically (for example, musical tones), so categorical perception is not limited to speech sounds. However, the fact that non-speech sounds are perceived categorically may simply mean that we use whatever apparatus we have for perceiving speech for perceiving other kinds of sound as well. Also, and perhaps more importantly, not all phonemes are perceived categorically; vowels, for instance, tend to be perceived non-categorically, and slight differences between two versions of the same vowel can be perceived. Finally, in 1975 at the Central Institute for the Deaf in St. Louis, Missouri, Pat Kuhl and Jim Miller found that chinchillas do it too.
Like many animals, chinchillas can be trained to move from one side of a box to another when they hear one sound (the target sound) but not when they hear any other sound. If the animal moves to the other side when a sound similar, but not identical, to the target sound is played, one can assume that the two sounds sounded the same to the chinchilla. If it stays put, they must have sounded different.
The finding that what is essentially a rodent can perceive phonemes categorically was a severe setback to any theory which supposed that categorical perception was a uniquely human ability resulting from the existence of phoneme-specific detectors. It did not help that certain kinds of monkey, and even certain kinds of bird, also perceive phonemes categorically! Subsequent analysis of what was going on suggested that the reason that within-category sounds (such as two different version of /b/) could not be discriminated was not due to specialized detectors (chinchillas are unlikely to have phoneme detectors), but was instead due to some particular property of the auditory system, shared by other non-human species.
The human ear, in fact any ear, is not a perfect channel for sound; what gets through is actually a distorted version of the original signal, and the signals that the brain receives reflect that distortion (in part due to the ear's physical properties, and in part due to distortions introduced within the nervous system that connects to the ear). Consequently, the inability to detect certain differences may reflect these distortions (the two different signals are distorted to the point of becoming the same). What may then have happened is that speech evolv
ed to take advantage of differences that could be perceived. The chinchilla does not distinguish between certain sounds because they happen to be phonemes, but because they happen to sound different. As a species, we have learned to ensure that if we want two sounds to be perceived as different (for the purposes, perhaps, of distinguishing between different words), we should ensure that we use just those sounds, or a subset of them, that can be distinguished.
Unfortunately, even this account cannot quite explain everything. There is a substantial fly in the ointment which places a complete, water-tight theory slightly out of reach.
Fast talking
The boundaries that we find in categorical perception should be relatively fixed if they reflect properties of the `hardware'. And yet Quentin Summerfield, at Queen's University in Belfast, demonstrated that the boundaries that separate one phoneme from another are not fixed. His basic finding is relatively simple. Recall that a phoneme on the /ba/-/pa/ continuum with a voice onset time of 40 milliseconds is perceived as [pa], and with a voice onset time of 20 milliseconds it is perceived as a [ba]. It was found that, if these sounds were heard in the context of speech uttered at different rates, then the boundary would shift. For instance, at a very fast speaking rate, something with a short voice onset time that should be perceived as a [ba] is in fact perceived as a [pa]. The perceptual system `knows', somehow, that everything is speeded up, and that what should normally be a longish time interval will now be a shorter interval. So a shorter interval is treated as if it were a longer one, and consequently a shorter voice onset time is treated as if it were a longer voice onset time. Similar effects have been found using other continua, with stimuli that vary not in terms of voice onset time (as on the /ba/-/pa/ continuum) but in terms of other subtle differences in timing (between the start of one frequency component and another).
The Ascent of Babel: An Exploration of Language, Mind, and Understanding Page 4