by Martin Ford
This problem links back to a philosophical idea that was elaborated in the 1960s by Paul Grice, J. L. Austin, and John Searle that language is action. For example, if I say to the computer, “The printer is broken,” then what I don’t want is for it to say back to me, “Thanks, fact recorded.” What I actually want is for the system to do something that will get the printer fixed. For that to occur, the system needs to understand why I said something.
Current deep-learning based natural-language systems perform poorly on these kinds of sentences in general. The reasons are really deeply rooted. What we’re seeing here, is that these systems are really good at statistical learning, pattern recognition and large-scale data analysis, but they don’t go below the surface. They can’t reason about the purposes behind what someone says. Put another way, they ignore the intentional structure component of dialogue. Deep-learning based systems more generally lack other hallmarks of intelligence: they cannot do counterfactual reasoning or common-sense reasoning.
You need all these capabilities to participate in a dialogue, unless you tightly constrain what a person says and does; but that makes it very hard for people to actually do what they want to do!
MARTIN FORD: What would you point to as being state-of-the-art right now? I was pretty astonished when I saw IBM Watson win at Jeopardy! I thought that was really remarkable. Was that as much of a breakthrough as it seemed to be, or would you point to something else as really being on the leading edge?
BARBARA GROSZ: I was impressed by Apple’s Siri and by IBM’s Watson; they were phenomenal achievements of engineering. I think that what is available today with natural language and speech systems is terrific. It’s changing the way that we interact with computer systems, and it’s enabling us to get a lot done. But these systems are nowhere near the human capacity for language, and you see that when you try to engage in a dialogue with them.
When Siri came out it in 2011, it took me about three questions to break the system. Where Watson makes mistakes is most interesting in that it shows us where it is not processing language like people do.
So yes, on the one hand, I think the progress in natural language and speech systems is phenomenal. We are far beyond what we could do in the ‘70s, partly because computers are way more powerful, and partly because there’s a lot more data out there. I’m thrilled that AI is actually out in the world making a difference because I didn’t think that it would happen in my lifetime—because it seemed the problems were so hard.
MARTIN FORD: Really, you didn’t think it would happen in your lifetime?
BARBARA GROSZ: Back in the 1970s? No, I didn’t.
MARTIN FORD: I was certainly very taken aback by Watson and especially by the fact that it could handle, for example, puns, jokes, and very complex presentations of language.
BARBARA GROSZ: But just going back to “The Wizard of Oz” analogy, you look behind what’s actually in those systems, and you realize they all have limitations. We’re at a moment where it’s really important to understand what these systems are good at and where they fail.
This is why I think it’s very important for the field, and frankly for the world, to understand that we could make a lot more progress on AI systems that would be good for people in the world if we didn’t aim to replace people, or build generalized artificial intelligence—but if we instead focus our understanding on what all these great capabilities are both good and not good for, and how to complement people with these systems, and these systems with the people.
MARTIN FORD: Let’s focus on this idea of going off script and being able to really have a conversation. That relates directly to the Turing test, and I know you’ve done some additional work in that area. What do you think Turing’s intentions were in coming up with that test? Is it a good test of machine intelligence?
BARBARA GROSZ: I remind people that Turing proposed his test in 1950, a time where people had new computing machines that they thought were amazing. Now of course, those systems could do nothing compared to what a smartphone can do today, but at the time many people wondered if these machines could think like a human thinks. Remember, Turing used “intelligence” and “thinking” similarly—he wasn’t talking about intelligence like say, Nobel prize-winning science type of intelligence.
Turing was posing a very interesting philosophical question, and he made some conjectures about whether or not machines could exhibit a certain kind of behavior. The 1950s was also at a time where psychology was rooted in behaviorism, and so his test is not only an operational test but also a test where there would be no looking below the surface.
The Turing test is not a good test of intelligence. Frankly, I would probably fail the Turing test because I’m not very good at social banter. It’s also not a good guide for what the field should aim to do. Turing was an amazingly smart person, but I’ve conjectured, somewhat seriously, that if he were alive today—and if he knew what we now know about how learning works, how the brain and language work, and how people develop intelligence and thinking, then he would have proposed a different test.
MARTIN FORD: I know that you’ve proposed some enhancements or even a replacement for the Turing test.
BARBARA GROSZ: Who knows what Turing would have proposed, but I have made a proposal that, given that we know that the development of human intelligence depends on social interaction, and that language capacity depends on social interaction, and that human activity in many setting is collaborative—then I recommend that we aim to build a system that is a good team partner, and works so well with us that we don’t recognize that it isn’t human. I mean, it’s not that we’re fooled into the idea that a laptop, robot, or phone is a human being, but that you don’t keep wondering “Why did it do that?” when it makes a mistake that no human would.
I think that this is a better goal for the field, in part because it has several advantages over the Turing test. One advantage is that you can meet it incrementally—so if you pick a small enough arena in which to build a system, you can build a system that’s intelligent in that arena, and it works well on that kind of task. We could find systems out there now that we would say are intelligent in that way—and of course children, as they develop, are intelligent in different limited ways, and then they get more and different kinds of smart in more varied kinds of ways.
With the Turing test, a system either succeeds or it fails, and there’s no guide for how to incrementally improve its reasoning. For science to develop, you need to be able to make steps along the way. The test I proposed also recognizes that for the foreseeable future people and computer systems will have complementary abilities, and it builds on that insight rather than ignoring it.
I first proposed this test in a talk in Edinburgh on the occasion of the 100th anniversary of Turing’s birth. I said given all the progress in computing and psychology, “We should think of new tests.” I asked the attendees at that talk for their ideas, and in subsequent talks. To date, the main response has been that this test is a good one.
MARTIN FORD: I’ve always thought that once we really have machine intelligence, we’ll just kind of know it when we see it. It’ll just be somehow obvious, and maybe there’s not a really explicit test that you can define. I’m not sure there’s a single test for human intelligence. I mean, how do you know another human being is intelligent?
BARBARA GROSZ: That’s a really good observation. If you think about what I said when I gave this example of “where’s the nearest emergency room and where can I go to get a heart attack treated?”, no human being you would consider intelligent would be able to answer one of those questions and not the other one.
There’s a possibility that the person you asked might not be able to answer either question, say if you plonked them in some foreign city; but if they could answer one question, they could answer the other question. The point is, if you have a machine that answers both questions, then that seems intelligent to you. If you have a machine that answers only one and not the other
question, then it doesn’t seem so intelligent.
What you just said actually fits with the test that I proposed. If the AI system is going along and acting, as it were, as intelligently as you would expect another human to act, then you’d think it is intelligent. What happens right now with many AI systems, is that people think the AI system is smart and then it does something that takes them aback, and then they think it’s completely stupid. At that point, the human wants to know why the AI system worked that way or didn’t work the way they expected, and by the end they no longer think it’s so smart.
By the way, the test that I proposed is not time-limited; in fact, it is actually supposed to be extended in time. Turing’s test was also not supposed to have a time limit, but that characteristic has been frequently forgotten, in particular in various recent AI competitions.
MARTIN FORD: That seems silly. People aren’t intelligent for only half an hour. It has to be for an indefinite time period to demonstrate true intelligence. I think there’s something called the Loebner Prize where Turing tests are run under certain limited conditions each year.
BARBARA GROSZ: Right, and it proves what you say. It also makes clear what we learned very early on in the natural-language processing arena, which is that if you have only a fixed task with a fixed set of issues (and in this case, a fixed amount of time), then cheap hacks will always win over real intelligent processing, because you’ll just design your AI system to the test!
MARTIN FORD: The other area that you have worked in is multi-agent systems, which sounds pretty esoteric. Could you talk a little about that and explain what that means?
BARBARA GROSZ: When Candy Sidner and I were developing the intentional model of discourse that I mentioned earlier, we first tried to build on the work of colleagues who were using AI models of planning developed for individual robots to formalize work in philosophy on speech act theory. When we tried to use those techniques in the context of dialogue, we found that they were inadequate. This discovery led us to the realization that teamwork or collaborative activity, or working together, cannot be characterized as simply the sum of individual plans.
After all, it’s not as if you have a plan to do a certain set of actions and I have a plan to do a certain set of actions, and they just happen to fit together. At the time, because AI planning researchers often used examples involving building stacks of toy blocks, I used the particular example of one child having a stack of blue blocks and another child having a stack of red blocks, and they build a tower that has both red and blue blocks. But it’s not that the child with the blue blocks has a plan with those blocks in spaces that just happen to match where the plan of the child with red blocks has empty spaces.
Sidner and I realized, at this point, that we had to come up with a new way of thinking about—and representing in a computer system—plans of multiple participants, whether people or computer agents or both. So that’s how I got into multi-agent systems research.
The goal of work in this field is to think about computer agents being situated among other agents. In the 1980s, work in this area mostly concerned situations with multiple computer agents, either multiple robots or multiple software agents, and asked questions about competition and coordination.
MARTIN FORD: Just to clarify: when you talk about a computer agent, what you mean is a program, a process that goes and performs some action or retrieves some information or does something.
BARBARA GROSZ: That’s right. In general, a computer agent is a system able to act autonomously. Originally, most computer agents were robots, but for several decades AI research has involved software agents as well. Today there are computer agents that search and ones that compete in auctions, among many other tasks. So, an agent doesn’t have to be a robot that’s actually out there physically in the world.
For instance, Jeff Rosenheim had some really interesting work in the early years of multi-systems agents research, which considered situations like having a bunch of delivery robots, and they need to get things all over the city, and maybe if they exchanged packages they could do it more efficiently. He considered questions like whether they would tell the truth or lie about the tasks they actually had to do, because if an agent lied, it might come out ahead.
This whole area of multi-agent systems now addresses a wide range of situations and problems. Some work focuses on strategic reasoning; other on teamwork. And, I’m thrilled to say, more recently, much of it is now really looking at how computer agents can work with people, rather than just with other computer agents.
MARTIN FORD: Did this multi-agent work lead directly to your work in computational collaboration?
BARBARA GROSZ: Yes, one of the results of my work in multiple-agent systems was to develop the first computational model of collaboration.
We asked, what does it mean to collaborate? People take an overall task and divide it up, delegating tasks to different people and leaving to them figuring out the details. We make commitments to one another to do subtasks, and we (mostly) don’t wander off and forget what we committed to doing.
In business, a common message is that one person doesn’t try to do everything, but delegates tasks to other people depending on their expertise. This is the same in more informal collaborations.
I developed a model of collaboration that made these intuitions formal, in work with Sarit Kraus, and then generated many new research questions including how you decide who’s capable of doing what, what happens if something goes wrong, and what’s your obligation to the team. So, you don’t just disappear or say, “Oh, I failed, Sorry. Hope you guys can do the task without me.”
In 2011-2012 I had a year’s sabbatical in California and I decided that I wanted to see if this work on collaboration could make a difference in the world. So, pretty much since then, I have been working in the healthcare arena developing new methods for healthcare coordination, working with Stanford pediatrician Lee Sanders. The particular medical setting is children who have complex medical conditions and see 12 or 15 doctors. In this context, we’re asking: how can we provide systems that help those doctors share information and more successfully coordinate what they’re doing.
MARTIN FORD: Would you say that health care is one the most promising areas for research for AI? It certainly seems like the part of the economy that most needs to be transformed and made more productive. I’d say we’d be much better off as a society if we could give transforming medicine a higher priority than having robots that flip hamburgers and produce cheaper fast food.
BARBARA GROSZ: Right, and healthcare is an area, along with education, where it’s absolutely crucial that we focus on building systems that complement people, rather than systems that replace people.
MARTIN FORD: Let’s talk about the future of artificial intelligence. What do you think about all of the focus right now on deep learning? I feel a normal person reads the press and could come away with the impression that AI and deep learning are synonymous. What would you point to, speaking of AI generally, as the things that are absolutely on the forefront?
BARBARA GROSZ: Deep learning is not deep in any philosophical sense. The name comes from there being many layers to the neural network. It isn’t that deep learning is more intelligent in the sense of being a deeper “thinker” than other kinds of AI systems or learning. It functions well because it mathematically has more flexibility.
Deep learning is tremendously good for certain tasks, essentially ones that fit its end-to-end processing: a signal comes in and you get an answer out; but it is also limited by the data it gets. We see this limitation in systems that can recognize white males much better than other kinds of people because there are more white males in the training data. We see it also in machine translation that works very well for literal language, where it’s had a lot of examples, but not for the kind of language you see in novels or anything that’s literary or alliterative.
MARTIN FORD: Do you think there will be a backlash against all the hype surroun
ding deep learning when its limitations are more widely recognized?
BARBARA GROSZ: I have survived numerous AI Winters in the past and I’ve come away from them feeling both fearful and hopeful. I’m fearful that people, once they see the limitations of deep learning will say, “Oh, it doesn’t really work.” But I’m hopeful that, because deep learning is so powerful for so many things, and in so many areas, that there won’t be an AI Winter around deep learning.
I do think, however, that to avoid an AI Winter for deep learning, people in the field need to put deep learning in its correct place, and be clear about its limitations.
I said at one point that “AI systems are best if they’re designed with people in mind.” Ece Kamar has noted that the data from which these deep learning systems learn, comes from people. Deep learning systems are trained by people. And these deep learning systems do better if there are people in the loop correcting them when they’re getting something wrong. On the one hand, deep learning is very powerful, and it’s enabled the development of a lot of fantastic things. But deep learning is not the answer to every AI question. It has, for instance, so far shown no usefulness for common sense reasoning!
MARTIN FORD: I think people are working on, for example, figuring out how to build a system so it can learn from a lot less data. Right now, systems do depend on enormous datasets in order to get them to work at all.
BARBARA GROSZ: Right, but notice the issue is not just how much data they need, but the diversity of the data.
I’ve been thinking about this recently; simply put, why does it matter? If I or you were building a system to work in New York City or San Francisco, that would be one thing. But these systems are being used by people around the world from different cultures, with different languages, and with different societal norms. Your data has to sample all of that space. And we don’t have equal amounts of data for different groups. If we go to less data, we have to say something like (and I’m being a bit facetious here), “This is a system that works really well for white men, upper income.”