by Martin Ford
In 1981, I was a postdoc in San Diego, California and David Rumelhart came up with the basic idea of backpropagation, so it’s his invention. Myself and Ronald Williams worked with him on formulating it properly. We got it working, but we didn’t do anything particularly impressive with it, and we didn’t publish anything. After that, I went off to Carnegie Mellon and worked on the Boltzmann machine, which I think of as a much more interesting idea, even though it doesn’t work as well. Then in 1984, I went back and tried backpropagation again so I could compare it with the Boltzmann machine, and discovered it actually worked much better, so I started communicating with David Rumelhart again.
What got me really excited about backpropagation was what I called the family trees task, where you could show that backpropagation can learn distributed representations. I had been interested in the brain having distributed representations since high school, and finally, we had an efficient way to learn them! If you gave it a problem, such as if I was to input two words and it has to output the third word that goes with that, it would learn distributed representations for the words, and those distributed representations would capture the meanings of the words.
Back in the mid-1980s, when computers were very slow, I used a simple example where you would have a family tree, and I would tell you about relationships within that family tree. I would tell you things like Charlotte’s mother is Victoria, so I would say Charlotte and mother, and the correct answer is Victoria. I would also say Charlotte and father, and the correct answer is James. Once I’ve said those two things, because it’s a very regular family tree with no divorces, you could use conventional AI to infer using your knowledge of family relations that Victoria must be the spouse of James because Victoria is Charlotte’s mother and James is Charlotte’s father. The neural net could infer that too, but it didn’t do it by using rules of inference, it did it by learning a bunch of features for each person. Victoria and Charlotte would both be a bunch of separate features, and then by using interactions between those vectors of features, that would cause the output to be the features for the correct person. From the features for Charlotte and from the features for mother, it could derive the features for Victoria, and when you trained it, it would learn to do that. The most exciting thing was that for these different words, it would learn these feature vectors, and it was learning distributed representations of words.
We submitted a paper to Nature in 1986 that had this example of backpropagation learning distributed features of words, and I talked to one of the referees of the paper, and that was what got him really excited about it, that this system was learning these distributed representations. He was a psychologist, and he understood that having a learning algorithm that could learn representations of things was a big breakthrough. My contribution was not discovering the backpropagation algorithm, that was something Rumelhart had pretty much figured out, it was showing that backpropagation would learn these distributed representations, and that was what was interesting to psychologists, and eventually, to AI people.
Quite a few years later, in the early 1990s, Yoshua Bengio rediscovered the same kind of network but at a time where computers were faster. Yoshua was applying it to language, so he would take real text, taking a few words as context, and then try and predict the next word. He showed that the neural network was pretty good at that and that it would discover these distributed representations of words. It made a big impact because the backpropagation algorithm could learn representations and you didn’t have to put them in by hand. People like Yann LeCun had been doing that in computer vision for a while. He was showing that backpropagation would learn good filters for processing visual input in order to make good decisions, and that was a bit more obvious because we knew the brain did things like that. The fact that backpropagation would learn distributed representations that captured the meanings and the syntax of words was a big breakthrough.
MARTIN FORD: Is it correct to say that at that time using neural networks was still not a primary thrust of AI research? It’s only quite recently this has come to the forefront.
GEOFFREY HINTON: There’s some truth to that, but you also need to make a distinction between AI and machine learning on the one hand, and psychology on the other hand. Once backpropagation became popular in 1986, a lot of psychologists got interested in it, and they didn’t really lose their interest in it, they kept believing that it was an interesting algorithm, maybe not what the brain did, but an interesting way of developing representations. Occasionally, you see the idea that there were only a few people working on it, but that’s not true. In psychology, lots of people stayed interested in it. What happened in AI was that in the late 1980s, Yann LeCun got something impressive working for recognizing handwritten digits, and there were various other moderately impressive applications of backpropagation from things like speech recognition to predicting credit card fraud. However, the proponents of backpropagation thought it was going to do amazing things, and they probably did oversell it. It didn’t really live up to the expectations we had for it. We thought it was going to be amazing, but actually, it was just pretty good.
In the early 1990s, other machine learning methods on small datasets turned out to work better than backpropagation and required fewer things to be fiddled with to get them to work well. In particular, something called the support vector machine did better at recognizing handwritten digits than backpropagation, and handwritten digits had been a classic example of backpropagation doing something really well. Because of that, the machine learning community really lost interest in backpropagation. They decided that there was too much fiddling involved, it didn’t work well enough to be worth all that fiddling, and it was hopeless to think that just from the inputs and outputs you could learn multiple layers of hidden representations. Each layer would be a whole bunch of feature detectors that represent in a particular way.
The idea of backpropagation was that you’d learn lots of layers, and then you’d be able to do amazing things, but we had great difficulty learning more than a few layers, and we couldn’t do amazing things. The general consensus among statisticians and people in AI was that we were wishful thinkers. We thought that just from the inputs and outputs, you should be able to learn all these weights; and that was just unrealistic. You were going to have to wire in lots of knowledge to make anything work.
That was the view of people in computer vision until 2012. Most people in computer vision thought this stuff was crazy, even though Yann LeCun sometimes got systems working better than the best computer vision systems, they still thought this stuff was crazy, it wasn’t the right way to do vision. They even rejected papers by Yann, even though they worked better than the best computer vision systems on particular problems, because the referees thought it was the wrong way to do things. That’s a lovely example of scientists saying, “We’ve already decided what the answer has to look like, and anything that doesn’t look like the answer we believe in is of no interest.”
In the end, science won out, and two of my students won a big public competition, and they won it dramatically. They got almost half the error rate of the best computer vision systems, and they were using mainly techniques developed in Yann LeCun’s lab but mixed in with a few of our own techniques as well.
MARTIN FORD: This was the ImageNet competition?
GEOFFREY HINTON: Yes, and what happened then was what should happen in science. One method that people used to think of as complete nonsense had now worked much better than the method they believed in, and within two years, they all switched. So, for things like object classification, nobody would dream of trying to do it without using a neural network now.
MARTIN FORD: This was back in 2012, I believe. Was that the inflection point for deep learning?
GEOFFREY HINTON: For computer vision, that was the inflection point. For speech, the inflection point was a few years earlier. Two different graduate students at Toronto showed in 2009 that you could make a better speech recognizer using de
ep learning. They went as interns to IBM and Microsoft, and a third student took their system to Google. The basic system that they had built was developed further, and over the next few years, all these companies’ labs converted to doing speech recognition using neural nets. Initially, it was just using neural networks for the frontend of their system, but eventually, it was using neural nets for the whole system. Many of the best people in speech recognition had switched to believing in neural networks before 2012, but the big public impact was in 2012, when the vision community, almost overnight, got turned on its head and this crazy approach turned out to win.
MARTIN FORD: If you read the press now, you get the impression that neural networks and deep learning are equivalent to artificial intelligence—that it’s the whole field.
GEOFFREY HINTON: For most of my career, there was artificial intelligence, which meant the logic-based idea of making intelligent systems by putting in rules that allowed them to process symbol strings. People believed that’s what intelligence was, and that’s how they were going to make artificial intelligence. They thought intelligence consists of processing symbol strings according to rules, they just had to figure out what the symbol strings were and what the rules were, and that was AI. Then there was this other thing that wasn’t AI at all, and that was neural networks. It was an attempt to make intelligence by mimicking how the brain learns.
Notice that standard AI wasn’t particularly interested in learning. In the 1970s, they would always say that learning’s not the point. You have to figure out what the rules are and what the symbolic expressions they’re manipulating are, and we can worry about learning later. Why? Because the main point is reasoning. Until you’ve figured out how it does reasoning, there’s no point thinking about learning. The logic-based people were interested in symbolic reasoning, whereas the neural network-based people were interested in learning, perception, and motor control. They’re trying to solve different problems, and we believe that reasoning is something that evolutionarily comes very late in people, and it’s not the way to understand the basics of how the brain works. It’s built on top of something that’s designed for something else.
What’s happened now is that industry and government use “AI” to mean deep learning, and so you get some really paradoxical things. In Toronto, we’ve received a lot of money from the industry and government for setting up the Vector Institute, which does basic research into deep learning, but also helps the industry do deep learning better and educates people in deep learning. Of course, other people would like some of this money, and another university claimed they had more people doing AI than in Toronto and produced citation figures as evidence. That’s because they used classical AI. They used citations of conventional AI to say they should get some of this money for deep learning, and so this confusion in the meaning of AI is quite serious. It would be much better if we just didn’t use the term “AI.”
MARTIN FORD: Do you really think that AI should just be focused on neural networks and that everything else is irrelevant?
GEOFFREY HINTON: I think we should say that the general idea of AI is making intelligent systems that aren’t biological, they are artificial, and they can do clever things. Then there’s what AI came to mean over a long period, which is what’s sometimes called good old-fashioned AI: representing things using symbolic expressions. For most academics—at least, the older academics—that’s what AI means: that commitment to manipulating symbolic expressions as a way to achieve intelligence.
I think that old-fashioned notion of AI is just wrong. I think they’re making a very naive mistake. They believe that if you have symbols coming in and you have symbols coming out, then it must be symbols in-between all the way. What’s in-between is nothing like strings of symbols, it’s big vectors of neural activity. I think the basic premise of conventional AI is just wrong.
MARTIN FORD: You gave an interview toward the end of 2017 where you said that you were suspicious of the backpropagation algorithm and that it needed to be thrown out and we needed to start from scratch. (https://www.axios.com/artificial-intelligence-pioneer-says-we-need-to-start-over-1513305524-f619efbd-9db0-4947-a9b2-7a4c310a28fe.html) That created a lot of disturbance, so I wanted to ask what you meant by that.
Geoffrey Hinton: The problem was that the context of the conversation wasn’t properly reported. I was talking about trying to understand the brain, and I was raising the issue that backpropagation may not be the right way to understand the brain. We don’t know for sure, but there are some reasons now for believing that the brain might not use backpropagation. I said that if the brain doesn’t use backpropagation, then whatever the brain is using would be an interesting candidate for artificial systems. I didn’t at all mean that we should throw out backpropagation. Backpropagation is the mainstay of all the deep learning that works, and I don’t think we should get rid of it.
MARTIN FORD: Presumably, it could be refined going forward?
GEOFFREY HINTON: There’s going to be all sorts of ways of improving it, and there may well be other algorithms that are not backpropagation that also work, but I don’t think we should stop doing backpropagation. That would be crazy.
MARTIN FORD: How did you become interested in artificial intelligence? What was the path that took you to your focus on neural networks?
GEOFFREY HINTON: My story begins at high school, where I had a friend called Inman Harvey who was a very good mathematician who got interested in the idea that the brain might work like a hologram.
MARTIN FORD: A hologram being a three-dimensional representation?
GEOFFREY HINTON: Well, the important thing about a proper hologram is that if you take a hologram and you cut it in half, you do not get half the picture, but instead you get a fuzzy picture of the whole scene. In a hologram, information about the scene is distributed across the whole hologram, which is very different from what we’re used to. It’s very different from a photograph, where if you cut out a piece of a photograph you lose the information about what was in that piece of the photograph, it doesn’t just make the whole photograph go fuzzier.
Inman was interested in the idea that human memory might work like that, where an individual neuron is not responsible for storing an individual memory. He suggested that what’s happening in the brain is that you adjust the connection strengths between neurons across the whole brain to store each memory, and that it’s basically a distributed representation. At that time, holograms were an obvious example of distributed representation.
People misunderstand what’s meant by distributed representation, but what I think it means is you’re trying to represent some things—maybe concepts—and each concept is represented by activity in a whole bunch of neurons, and each neuron is involved in the representations of many different concepts. It’s very different from a one-to-one mapping between neurons and concepts. That was the first thing that got me interested in the brain. We were also interested in how brains might learn things by adjusting connection strengths, and so I’ve been interested in that basically the whole time.
MARTIN FORD: When you were at high school? Wow. So how did your thinking develop when you went to university?
GEOFFREY HINTON: One of the things I studied at university was physiology. I was excited by physiology because I wanted to learn how the brain worked. Toward the end of the course they told us how neurons send action potentials. There were experiments done on the giant squid axon, figuring out how an action potential propagated along an axon, and it turned out that was how the brain worked. It was rather disappointing to discover, however, that they didn’t have any kind of computational model of how things were represented or learned.
After that, I switched to psychology, thinking they would tell me how the brain worked, but this was at Cambridge, and at that time it was still recovering from behaviorism, so psychology was largely about rats in boxes. There was some cognitive psychology then but they were fairly non-computational, and I didn’t really get much sense
that they were ever going to figure out how the brain worked.
During the psychology course, I did a project on child development. I was looking at children between the ages of two and five, and how the way that they attend to different perceptual properties changes as they develop. The idea is that when they’re very young, they’re mainly interested in color and texture, but as they get older, they become more interested in shape. I conducted an experiment where I would show the children three objects, of which one was the odd one out, for example, two yellow circles and a red circle. I trained the children to point at the odd one out, something that even very young children can learn to do.
I’d also train them on two yellow triangles and one yellow circle, and then they’d have to point at the circle because that was the odd one out on shape. Once they’d been trained on simple examples where there was a clear odd one out, I would then give them a test example like a yellow triangle, a yellow circle, and a red circle. The idea was that if they were more interested in color than shape, then the odd one out would be the red circle, but if they were more interested in shape than color, then the odd one out would be the yellow triangle. That was all well and good, and for a couple of children, they pointed out either the yellow triangle that was a different shape or the red circle that was a different color. I remember, though, that when I first did the test with one bright five-year-old, he pointed at the red circle, and he said, “You’ve painted that one the wrong color.”
The model that I was trying to corroborate was a very dumb, vague model that said, “when they’re little, children attend more to color and as they get bigger, they attend more to shape.” It’s an incredibly primitive model that doesn’t say how anything works, it’s just a slight change in emphasis from color to shape. Then, I was confronted by this kid who looks at them and says, “You’ve painted that one the wrong color.” Here’s an information processing system that has learned what the task is from the training examples, and because he thinks there should be an odd one out, he realizes there isn’t a single odd one out, and that I must have made a mistake, and the mistake was probably that I painted that one the wrong color.