Only one extra detail needs to be added to allow this to happenwhen a pattern is input to the network across the letter neurons, some thing has to be able to tell the network what pattern of activation it should output across the phoneme neurons. What happens is that the computer looks at the activation value of each of the output (phoneme) neurons and compares it with what it should have been if the right activation pattern had been produced. It then modifies the strength of each connection leading to that neuron by a very small amount that is dependent on how close it got to the right pattern. It does this for each connection leading to each of the output neurons. It then does the same thing for each connection leading to each of the intermediary neurons. It works out (in effect) what the activation pattern across these neurons should have been in order to get something a little more similar to the correct activation at the output neurons. It then works out how each one of these neurons did compared to how it should have done, and modifies the strength of each connection leading to each neuron, but again, only by a very small amount. It sounds complicated, but it just requires some fairly simple mathematics (which computers are good at). This whole process is repeated for each pairing of input and output patterns presented to the network, with each presentation causing the connection strengths to change very slightly. Eventually, as the network learns those input-output pairings, the changes to those connection strengths become even more slight, until they stop altogether when the network gets each pairing right.
David Rumelhart, Geoffrey Hinton, and Ronald Williams at the University of California in San Diego first described this learning procedure in the mid-1980s. Although it is very unlikely that real brains systematically modify the sensitivity of their neural connections in exactly the same way, we do know that those sensitivities do change (to the point where the connections may even disappear). The principles, if not the methods, are the same.
Coping with the sequential structure of language
The problem with language is that things happen one after the other. The neural network that we have been looking at is severely limited because, although it can learn to associate one thing with another, the things it learns about are static, unchanging patterns. What is needed is a network which can take a single pattern across one set of neurons (representing, for instance, the visual image of a word) and associate that with a sequence of patterns, one after the other, across another set (representing the sequence of phonemes that need to be uttered, one after the other, to say that word). Or, better still, a network which can take a sequence of patterns across one set of neurons (representing, perhaps, the incoming speech signal) and generate another sequence of patterns across another set of neurons (representing, perhaps, the developing meaning of that signal).
Our example network has a further limitation that is in some respects even more serious. In order to learn anything, it needs the equivalent of a teacher who can tell it that the current activation pattern is not quite right, and can tell it the correct pattern that it should be aiming for. But except for when we are explicitly taught to read, and are explicitly taught such things as letter-to-sound correspondences, nobody teaches us what the correct activation patterns should be in response to what we hear. Could we design a network which did not rely on an explicit teacher?
In the late 1980s, Jeffrey Elman, working at the University of California, San Diego, developed a neural network that addressed both these drawbacks. He borrowed an idea that had previously been developed by a colleague of his, Michael Jordan. Jordan had demonstrated that a very simple extension to our example network could learn to output a sequence of things, one thing after another, when given just a single input pattern. The network acted as if it had a queue, or buffer, containing activation patterns waiting to be output, with something directing which pattern should come next (see Chapter 10 for the application of queues in language production). Jordan's extension gave the network the equivalent of a memory for what it had output so far. Fortunately, his new network could use the same learning procedure as before. Elman extended Jordan's technique so that the network would also be able to take as input a sequence of things. Better still, he got rid of the need for any explicit teaching. This is how the `Elman net' works:
Imagine that at the first tick of a clock an activation pattern spreads from the input neurons to the intermediary neurons. At the second tick it spreads from the intermediary neurons to the output neurons. On the third tick, a new input (the next element in the sequence; feeds through from the input neurons to the intermediary neurons, and so on. Elman added an extra step. At the first tick, as before, activation would spread from the input neurons to the intermediary neurons. At the second tick, two things would happen. Activation would still spread from the intermediary neurons to the output neurons. But it would also spread to a new set of neurons (copy neurons) that were wired-up so that they would duplicate the activation pattern across the intermediary oneseach copy neuron received activation from just one intermediary neuron, and a connection strength of one ensured that the copy neuron would take on the activation value of the intermediary neuron it was connected to. But each copy neuron had connections back to each of the intermediary neurons. So on the third tick, the intermediary neurons would receive both new activation from the input neurons, and a copy of the previous pattern of activation across those intermediary neurons. The pattern of activation across the intermediary neurons would therefore embody both the network's reaction to the new input, and the network's reaction to the previous input. And of course, that reaction was also a reflection of the reaction before that, and before that. The network therefore had a memory of how it had reacted to previous inputs, and how they were sequenced through time. And because it had a memory of its previous reactions, each output could be determined, in part, by that memory.
Because the Elman net has a memory for how it has reacted to previous input, it can either take a sequence of things in one order, and output one thing, or take those same things in a different order, and output something else (whether a static pattern or a sequence of changing patterns). We shall see an example of what the net was capable of in the next section. But a final innovation was Elman's realization that his network could be taught without an explicit teacher. He designed his network so that it would predict what its next input would be.
Neural networks and the prediction task
Whereas our earlier network could learn to associate one pattern with another, the Elman net can learn to associate a changing sequence of patterns with another changing sequence of patterns. The prediction task is just a variant on the idea that one sequence can be associated with another. Each element in an input sequence causes the network to output something. That output is compared with the next input, and the strengths modified according to the discrepancy. In principle, then, the Elman net can be trained to predict, on the basis of what it has seen so far of an input sequence, what the next element in that sequence is likely to be. But how useful is this? Surely we are not exposed to sequences of words, for instance, which allow us to predict with any certainty what the next element is going to be? In fact, we are. Here is what Elman did.
Using a limited vocabulary, Elman generated around 10 000 very short sentences. Each word, like each letter in our earlier example network, was allocated its own unique activation pattern. So a sentence, when presented to the network, would cause a sequence of activation patterns across the input neurons that corresponded to the sequence of words making up the sentence. These would spread through the network and cause a sequence of activation patterns across the output neurons. The network's task was to predict, after each word in the sentence, what the next word was going to be. Each output pattern would be compared against the pattern allocated to the next word in the input sequence, and the connection strengths changed (ever so slightly) according to how different the two patterns were. This was repeated many times for each of the 10 000 sentences. Not surprisingly, the network never managed to predict, with any
great success, what the next word would ever be-something like `the boy' could be followed by any number of different verbs, for instance, but the network would not be able to predict which verb. So why all the excitement?
Elman knew full well that the network could not be expected to predict with any accuracy the next word in any sequence. But after repeated exposure to the many different sentences, it did none the less learn to output, at each step through the sequence, a complex pattern that represented the variety of different words that might occur next. If, for example, the boy ate a sandwich in one sentence, and a cake in another, the network would predict, after `The boy ate', that the next word would be `sandwich' or `cake'. It would do this by outputting the pattern of each one of these words simultaneously, superimposed on one another to form a composite pattern.
Elman was also interested in something else the network did. If a pattern of activation across the input neurons represented a particular word, what did the patterns of activation that developed across the intermediary neurons represent? With successive exposure to different sequences (or the same sequence, come to that), the learning procedure changed all the connection strengths linking the input neurons to the intermediary ones. The same input pattern would therefore lead to different intermediary patterns as the learning progressed. So different patterns evolved during (and as a consequence of) that learning. If activation patterns can be equated with representations, that means that the network had evolved its own, internal, representations. But of what?
In order to answer this question, Elman waited until the network had reached a point where the changes were very slight indeed, and, in effect, there was nothing more it could learn from successive exposure to the sentences it was being trained on-it was not getting any better. He then used a statistical procedure to analyse all the intermediary activation patterns that were produced in response to each input word. This allowed him to see which words caused similar patterns of activation across those intermediary neurons, and which caused different patterns. He found that all the nouns produced similar activation patterns, and the verbs did so too, but the two sets of patterns were quite distinct. Crucially, the patterns input to the network across the input neurons were just arbitrary patterns. Some might have been similar through chance, but others were quite different. But this did not stop the network from learning to distinguish between the noun patterns and the verb patterns.
In addition to being able to distinguish between nouns and verbs, the network also learned to distinguish between transitive and intransitive verbs ('chase' vs. `sleep'), and between animate nouns ('boy', `lion', `monster') and inanimate nouns ('sandwich', `plate', `glass'). It distinguished also between animals and humans, and between things that were edible and things that were not. In each case, words within one category would cause patterns of activation across the intermediary neurons which were similar, but which would be quite different from the patterns caused by words from another category. How could any of this come about?
The only information available to the network was in the form of activation patterns across its input neurons. For each word (that is, for each activation pattern), it had information about what had come beforehand in the sequence, and what had come after. And that is exactly the information that distinguishes nouns from verbs-they occur in different contexts. For instance, in the sentences that Elman used, verbs would be preceded by nouns, and nouns sometimes by verbs, but never by nouns. Similarly, certain nouns would occur in the context of certain verbs but not in the context of certain other verbs: inanimate nouns could only occur before certain kinds of verb, edible nouns after certain kinds of verb, and so on. In fact, all the distinctions that the network made were based solely on the fact that different kinds of word occurred in different kinds of context. The network's memory meant that it could `spot' that certain kinds of word occurred in certain similar kinds of context, whereas certain other kinds of word occurred in different kinds of context.
This still fails to explain how the network spotted anything at all. It would need some mechanism that would cause it to form representations that were defined by the contexts. This is where the prediction task comes in. The first step in the argument to explain all this is that the network did learn to predict which words could come next. And because the different words that can occur in the same position within a sentence must have the same syntactic category (e.g. noun, verb), the output patterns would necessarily come to reflect exactly those categories, with finer distinctions being made for subcategories that appeared in some contexts but not others-hence the distinction between animates and inanimates, transitives and intransitives, and so on. So the output neurons reflected the syntactic category of the next word in the input. But something must also have reflected, for the right category to be predicted, the syntactic categories that had come earlier in the sequence. That is what the intermediary neurons did. In Elman's sequences, the best predictor of the next word was the immediately preceding word (in fact, the current word being `seen' by the network), so the most obvious characteristic that was encoded by those neurons was the syntactic category of that word. In effect, this reflected the range of words that could occur in that position in the sentence.
The general principle at work here is that the intermediary neurons encode whatever property of the input sequences allows the correct (or best) predictions to be made at the output neurons. This happens because the connection strengths within the network change as a function of how good the prediction has been. If the intermediary neurons manage to encode a property of the input sequences that is highly predictive of the correct output, the strengths will be changed only very slightly, if at all. But if the intermediary neurons have failed to encode any property that is predictive of the correct output, the strengths will be changed quite substantially, across the many exposures that the network receives. And because the network's memory is encoded in those connection strengths, anything that is not predictive will, in effect, be forgotten.
So the Elman net does not use its memory to store a faithful reproduction of everything that it has ever encountered. If `sandwich' and `cake' had occurred in exactly the same contexts, they would have given rise to the same internal representations (that is, patterns of activation across its intermediary neurons). But in real language, different words tend to occur in different contexts, and Elman's simulations attempted to capture this. `Sandwich' and `cake' occurred in subtly different contexts and gave rise to subtly different representations, but whatever could be predicted by `sandwich' that was the same as whatever could be predicted by `cake' was encoded, by the network, in that part of the representation that was common to both words. This explains why all the words of the same syntactic category evoked, in Elman's network, similar activation patterns across the intermediary neurons-the overlap between the individually distinct patterns conveyed information that applied to each word in that category, or, in other words, the appropriate generalizations.
One final property of these networks: if the network sees a word like `the', it can predict that the next word will be a noun (e.g. `cake') or an adjective (e.g. `big'). So a composite pattern will be output that reflects both these possibilities. However, if, in the network's experience, it is more common for a noun to follow words like `the' than it is for an adjective to follow them, the pattern will reflect the difference in the frequency of occurrence. This is a straightforward consequence of the manner in which the connection strengths are changed slightly each time the network encounters each word. If there are lots of nouns in that position, they will pull the strengths in one direction. If there are lots of adjectives, they will pull in another. The final balance depends simply on how many pulls there are in each direction. So, not only does the output of the network reflect the range of predictions that are possible at each point in the sequence, it also reflects the likelihood of each of those predictions.
That, briefly, is the Elman net. We do not know whether the real neural networks ope
rating in our brains do the same kinds of thing. Probably, they do not. But in all likelihood, some of the principles are the same. The remainder of this chapter will look at how networks exhibiting the same properties as the Elman net might explain some of the phenomena that the preceding chapters have introduced. Much of what follows is conjecture, but it is conjecture based on observable fact.
On the meaning of meaning
In the Elman net, the information that was stored about each wordwhether it was specific information or general information shared with similar words-was about the contexts in which each word could occur. `Cake' appeared after verbs like `eat', not `chase'. 'Dog' followed verbs like `chase', but not verbs like `eat'. Yet they both followed verbs. With this limited information, the network made subtle distinctions between edible things, inedible things, animals, and humans. It made distinctions that we might normally think of as having something to do with the words' meanings.
In Chapter 9, the meaning of something was defined as the knowledge about the contexts in which that something could occur. By this criterion the Elman net had acquired an element of meaning. It was very limited, because the only context it had available to it, and on which basis it could distinguish between different words, was the linguistic context. But imagine that the network could receive information more generally about the contexts in which those words would ordinarily be used. The network might have neurons that received information from a retina, or from an ear. The network would not know that these different inputs corresponded to different kinds of information, just as it did not know that the input in Elman's original simulations reflected words in a language (and just as our own neurons do not know what their input reflects). But the network still did a good job of categorizing the words in ways which, as defined by that language, were meaningful. With additional inputs, reflecting other aspects of the contexts in which those words would ordinarily be experienced, the network ought to be able to do an even better job. The nature of the prediction task means that only aspects of the context that are predictive of the current word would be encoded.
The Ascent of Babel: An Exploration of Language, Mind, and Understanding Page 26