The Ascent of Babel: An Exploration of Language, Mind, and Understanding
Page 27
One of the puzzles in Chapter 4 was to explain how a child would know which aspects of the context to associate with the sounds he or she heard. That puzzle is effectively solved if the only aspects selected are those that are predictive, or predicted by, those sounds. And this, according to the descriptions given in Chapter 9, is exactly what is required in order to capture the meaning of something. In fact, meaning is nothing more than that very encoding. In principle then, even an artificial neural network could achieve such an encoding-it could achieve meaning.
Who did what, and to whom
All this talk of neural activation, the encoding of experience, and prediction, is a far cry from the earlier talk (in Chapter 8) of participants, roles, and the assignment of one to the other. On the face of it, it looks as if we have ended up with an analysis of how we derive meaning that is quite different from that earlier role-assignment approach. In fact, we have simply ended up with a different vocabulary for describing the same process.
One of the puzzling properties of the way in which we go about assigning roles (to use that vocabulary) is that we apparently assign them without waiting for the grammatical information that would unambiguously signal which assignments should be made. We assume that, in the sequence `The woman that Bertie presented the wedding ring ... the woman is being given the wedding ring even before we encounter the grammatical information, later on in the sentence, that would tell us whether this was right. It need not be. The sentence could be `The woman that Bertie presented the wedding ring to his fiancee in front of was his cousin'-yes this is a difficult sentence, but if we blindly obeyed the principles of grammar, we should not assign any role to the woman until we reached the gap between `of and `was'. It looks from the evidence (see Chapter 8) as if there is some sort of need to allocate each participant a role as soon as one becomes available. We are even prepared to make preliminary role assignments which must subsequently be revised. Why? This is where the more recent talk of neural encoding and prediction comes in.
When we encounter a sentence like `A balding linguist ate a very large fish', our experience of similar linguistic contexts ('an X Y'd') correlates with our experience of X doing the Y'ing (as opposed to X being Y'd). When the verb `ate' is encountered in this sentence, a pattern of neural activity ensues which reflects this experience. And in so doing, it reflects, in effect, the assignment of the `eater' role to the linguist. When `a very large fish' is encountered, the ensuing pattern of neural activity reflects the assignment of the `being eaten' role to the fish. So that is how the neural equivalent of role-assignment works. But each pattern of neural activity also constitutes a prediction of what the successive patterns will be. In a world in which linguists ate fish 75% of the time, and spaghetti the remaining 25%, the patterns of neural activity after `ate' would reflect these differences. They would, in effect, predict one assignment as being more likely than the other, before the sentence unambiguously signalled which was the correct assignment. Of course, we do not live in such a world, but this example demonstrates that if, after `ate', there was a strong likelihood that one thing, rather than another, would fill the being-eaten role, then this would be reflected in the pattern of neural activity at that point.
In sequences of the form `The X that . . .' (as in `The woman that. . .'), the X will be assigned a role by something later on in the sentence, in the relative clause. The X could in principle fill any of the roles that become available when the next verb is encountered. And because the predictions that are made at each step in a sentence reflect what is possible given our experience, it follows that the neural activity evoked by the sequence `The woman that Bettie presented the wedding ring ...' will reflect the possibility that the woman is the recipient of the wedding ring. And because, in our experience, the thing that occupies the same position as `the woman' in this sentence is almost always assigned one of the roles associated with the next verb, the possibility that the woman is the recipient in this case is very strong. The neural activity would reflect the strength of this possibility. It would reflect, in effect, that particular role-assignment, before the point in the sentence that would unambiguously signal that this was the correct role assignment. The relationship between meaning, prediction, and experience, makes such `early' role assignments an inevitability.
Time flew like an arrow
The link between prediction and meaning ensures that only certain aspects of our experience (or a network's) become encoded as the meaning of something. It also ensures that when a particular combination of contextual factors is encountered, certain predictions will be more likely, and so more influential, than others. This has consequences for the way in which ambiguities are resolved. An example from Chapter 7 involved eating pizza with your friends, your fingers, your favourite topping, your favourite wine, or your customary enthusiasm. The image that is conjured up by hearing that your friend ate pizza with his favourite film star probably does not involve that film star being poured into a glass, being used to cut through the pizza, or being sprinkled on top. We are usually quite unaware of these other possibilities. Why? Because past experience prevents the corresponding predictions from being made.
Chapter 7 ended with the observation that many factors can influence our interpretation of ambiguous sentences. Sometimes it might be the plausibility of the role-assignments. At other times it might be the frequency of the occurrence in the language at large of one kind of interpretation rather than another (or perhaps, of one kind of grammatical structure rather than another). At other times it might be the fit with the context. Each of these different factors will cause the network (real or artificial) to predict some aspect of its future input.
These factors are influential only insofar as they are predictive. Some factors may be more predictive than others. For example, the frequency of occurrence of a particular syntactic sequence in the language at large may be much more predictive of what will happen next than any other factor. But on occasion, and depending on the context, some other factor may be more predictive. The patterns of activation that encode these predictions will be superimposed one on the other, and depending on the precise circumstances (the preceding input) one pattern may dominate. What counts is not the kind of information that constitutes each factor, but simply how predictive it is.
Words and how we found them
An Elman-like network with sufficiently rich input could derive meaning from sequences of words. At least, that is the conjecture. But why should the linguistic input to this hypothetical network be confined just to sequences of words? As far as the original Elman net was concerned, it was simply experiencing different activation patterns across its input neurons. It could not know what those patterns cor related with in the world beyond its neurons. In Elman's original experiments, the activation patterns across the input neurons did represent whole words. But, although we perceive what we hear as sequences of words, each word is itself a sequence of phonemes, and phonemes can themselves be broken down into subtly different acoustic patterns. If an Elman-like net was given these acoustic patterns as input, what would happen?
By trying to predict what will come next, an Elman-like net will learn to encode information that is predictive of the kinds of context in which an input sequence might occur. If it is a sequence of phonemes (or of the acoustic patterns that make up each phoneme) the network will learn to encode information about the contexts in which those sequences might ordinarily occur. It would be able to predict the range of phonemes (or equivalent) that could continue a sequence-in effect, the range of words that are compatible with the sequence of phonemes heard so far. And as the network `heard' progressively more of any one sequence, the number of possible continuations would fall, and the network would progressively activate patterns that reflected more strongly the remaining predictions. But there is more. With the right kinds of input (more than just the linguistic input), and sufficient exposure to the language, the network could in principle learn not simply what the next phonem
e was likely to be, but more generally it could learn about the context in which that entire sequence would occur-in effect, the meaning of the word composed of those phonemes. So as more of each word was heard, the network's internal activation patterns would reflect more strongly the meaning of each word that was still compatible with the input. Eventually, when no other continuation was possible, they would reflect just the one meaning.
The description given in Chapter 6 of this process included the idea that a sequence of phonemes stimulates a neural circuit much like a sequence of numbers opens a mechanical combination lock, with successive tumblers falling into place one number after another. This analogy is in fact inappropriate. The different neural circuits are not physically separable in the same way that different combination locks are. One can think of each word that is input to the network as activating a separate neural circuit, but it is the same neurons each time, just with different activation patterns across them. It is the same connections also.
The process by which we recognize spoken words is complicated, as we saw in Chapter 6, by the fact that the same word will often be pronounced using different phonemes, depending on the surrounding words. So the sequence corresponding to `Hand me that thin book' might on occasion sound more like `hameethathimboo'. One possible solution, mentioned in that chapter, was that we use rules which define the circumstances in which a phoneme of one kind should be interpreted as a phoneme of another. We would then recover the meaning of the re-interpreted sequence-something that sounded like `thim' would be re-interpreted as `thin' if the following phoneme had been a /b/ (as in `book'). A rule of this kind is nothing more than a statement of the contextual conditions in which a particular meaning should be associated with one sequence of phonemes rather than another. This is exactly the kind of thing that networks can learn. If `thim' had been experienced in exactly the same contexts as had been experienced with `thin', except that the following phoneme was a /b/, the network would inevitably activate a pattern across its intermediary neurons that reflected this experience. As long as `thim' was encountered before a /b/, the network would activate a pattern across its intermediary neurons that was, to all intents and purposes, the same as that for `thin'.
To the extent that linguistic rules are simply generalizations about the contexts in which one can predict one thing to happen or another, a network exhibiting the same properties as an Elman net ought to be able to encode information that is equivalent to those rules. In fact, it is difficult to see how else a set of rules could be encoded within the neural circuitry.
Words and what we learnt to do with them
In this final section, we move away from what an Elman-like net could in principle do, back to what Elman's net actually did.
Many linguists and psycholinguists believed that the acquisition of grammatical knowledge would not be possible if the only input to the learning device was the language itself. The basic problem was that grammatical knowledge was believed to exist as rules about the relative positioning of syntactic categories (things like `noun' and `verb') within the sentences of the language. But if you did not know about syntactic categories, how could you generate these rules? Even if you knew about which syntactic categories existed in your language, how would you know which words belonged to which category? The only solution to this problem (so the argument went) was to assume that some of the knowledge that was necessary for learning about grammar was innate (see Chapter 4). Several researchers suggested instead that because those linguistic rules were simply generalizations about which words could occur where in the sentence, all the child needed to do wa; calculate the equivalent-in other words, calculate the individual positions of each word relative to each other word in the language. The problem, at the time, was how the learning device would avoid learning irrelevant facts. Knowing that a noun might appear four words before a verb, or seven words after `a' or `the' would not be very helpful. In any case, it would be impossible to store in memory every possible fact about every possible position in which every possible word could occur. The Elman net demonstrated a simple solution.
The only information that Elman's network encoded was information that was predictive of the next word. In effect, it did simply calculate the position of each word relative to each other. It kept the information that was predictive, and discarded the information that was not. It encoded the information it kept as a combination of information that was specific to each word, and information that constituted generalizations that applied to whole groups of words. And as we saw earlier in this chapter, by developing those generalizations, the network developed the equivalent of syntactic categories and knowledge about which order they could appear in. It acquired grammar.
It might appear that there is little these networks cannot do. But an inherent problem with many artificial neural networks is that they can be quite unpredictable. Their mathematical properties do not guarantee that two networks with, for example, different numbers of neurons, will behave in exactly the same way. Much of the previous discussion is therefore limited to speculation until such time as the appropriate networks are built, and trained on the appropriate kinds of input. But even if this happened, and they did all the things we hoped they could do, there would still be a lot they could not do. They would not play. They would not have the same drives or desires as their human counterparts. They would not sit on the kitchen floor and lick the cake mixture off the spoon. They would not interact with their environment. They would not develop physically. Would that matter, though? To the extent that these are all parts of the child's experience, yes. To the extent that artificial neural networks are being built which mimic aspects of neural development, or which can learn to interact with their environment in appropriate ways, perhaps not.
Elman's prediction task is appealing because as tasks go, it has obvious evolutionary advantages-watching a moving object, catching a fly, chasing a beetle, and fleeing a foe all require the organism to predict, from one moment to the next, what is about to happen. But although it has obvious appeal, is it really enough to explain the phenomena that we have been dealing with here? Probably not. Even a frog can track a moving object, catch a fly, chase a beetle, or flee a cat. But can it talk? Does it understand? True, it does not grow up in the same environment that we do, but attempts to bring up even chimpanzees in the home environment have failed to produce a human chimpanzee. Their language is at best limited, and there is considerable controversy surrounding the claim that such chimpanzees can acquire even the smallest rudiments of grammatical knowledge. Perhaps what differs across the species is the sophistication and subtlety of prediction that each species is capable of. At the least, the difference must have something to do with differences in the sophistication, subtlety, and development, of their neural networks. It is therefore instructive to consider the principles, predictive or otherwise, that might be shared between natural and artificial neural networks. To echo a by-now familiar theme, the methods may well be different, but the principles may well be similar. And if we understand those principles, we necessarily understand better whatever lies at Babel's summit.
The descent from Babel
Depending on which books you read, language evolved as a form of social grooming (much likes apes picking at each other's nits), or as a form of social cooperation. Perhaps it evolved as an extension of the animal calls that other species exhibit, or as an extension of the various gestures that we (and other animal species) are capable of. No one can be sure why language evolved. There is something compelling about the argument that language evolved because a people with even a rudimentary language, and the ability to socially organize themselves through that language, would be better able to defend themselves, hunt, share out the food, reproduce, and so on. If this were right, the evolution of language would be little more mysterious than the evolution of hunting in packs, or migrating in groups, or living in family units. They each involve essentially social activities that improve the survival of the species. And
the only reason monkeys did not evolve language of the complexity and richness of our own is that they did not need to. They adapted to their environment in a fundamentally different way than Homo habilis did (so-called because of his use of tools, around 2 million years ago). Being adapted physically to the environment was not the only way in which the survival of a species could be ensured. Had we not been able to organize ourselves socially (and linguistically), we would not have progressed, in evolutionary terms, so remarkably. By the time modern man came along (about 100 000 years ago), language was already, perhaps, quite firmly established.
Evolving languages
It is generally accepted that languages, like the different races of humans, evolved from a common origin. The first hint that this may be the case arose in the eighteenth century when an Englishman, Sir William Jones, was appointed as a judge in Calcutta. Much Hindu law was written in Sanskrit which, at that time, was no longer spoken, except as part of religious ceremonial. As Jones learnt more about it, he started to notice similarities between words in Sanskrit, Greek, Latin, German, Italian, Spanish, French, Czech, Icelandic, Welsh, English, and several other of what are now known as the Indo-European languages. Example words from these languages are `mater', `meter', `mater', `Mutter', `madre', `madre', `mere', `matka', `moDir', and 'main', which all mean `mother', and `nakt', `nux', `nox', `Nacht', `notte', `noche', `nuit', `noc', `nott', and `nos', which all mean `night'.