by Brad Stone
Over the ensuing months, that amicable consensus devolved into a long-standing tug-of-war between Hart and his engineers, who saw playing music as a practical and marketable feature, and Bezos, who was thinking more grandly. Bezos started to talk about the “Star Trek computer,” an artificial intelligence that could handle any question and serve as a personal assistant. The fifty-cent word “plenipotentiary” was used inside the team to describe what he wanted: an assistant invested with full powers to take action on behalf of users, like call for a cab or place a grocery order. With his science fiction obsession, Bezos was forcing his team to think bigger and to push the boundaries of established technology. But Hart, facing pressure to actually ship the product, advocated for a feature set he dubbed “the magical and the mundane” and pushed to highlight basic, reliable features like allowing users to ask for weather reports as well as setting timers and alarms.
The debate manifested itself in endless drafts of the “PR FAQ”—the six-page narrative Amazonians craft in the form of a press release at the start of a new initiative to envision the product’s market impact. The paper, a hallowed part of Amazon’s rituals around innovation, forces them to begin any conversation about a new product in terms of the benefit it creates for customers. Dozens of versions of the Doppler PR FAQ were written, presented, debated, obsessed over, rewritten, and scrapped. Whenever the press release evolved to highlight playing music, “Jeff would get really mad. He didn’t like that at all,” recalled an early product manager.
Another early Doppler employee later speculated that Bezos’s famous lack of sophisticated musical tastes played a role in his reaction. When Bezos was testing an early Doppler unit, for example, he asked it to play one of his favorite songs: the theme to the classic TV show Battlestar Galactica. “Jeff was pushing really hard to make sure this product was more than just music,” said Ian Freed, Greg Hart’s boss. “He wouldn’t let go of it being a more generalized computer.”
A related discussion centered around the choice of a so-called “wake” word—the utterance that would rouse Doppler out of passive mode, when it was only listening for its own name, to switch into active listening, where it would send user queries over the internet to Amazon’s servers and return with a response. The speech science team wanted the wake word to have a distinct combination of phonemes and be at least three syllables, so the device wouldn’t be triggered by normal conversation. It also needed to be distinctive (like “Siri”) so that the name could be marketed to the public. Hart and his team presented Bezos with hundreds of flash cards, each with a different name, which he would spread out on conference room tables during the endless deliberations.
Bezos said he wanted the wake word to sound “mellifluous” and opined that his mother’s name, Jacklyn, was “too harsh.” His own quickly discarded suggestions included “Finch,” the title of a fantasy detective novel by Jeff VanderMeer; “Friday,” after the personal assistant in the novel Robinson Crusoe; and “Samantha,” the witch who could twinkle her nose and accomplish any task on the TV show Bewitched. For a while, he also believed the wake word should be “Amazon,” so that whatever aura of good feeling the device generated would be reflected back onto the company.
Doppler execs argued that people would not want to talk to a corporate entity in their homes, and that spawned another ongoing disagreement. Bezos also suggested “Alexa,” an homage to the ancient library of Alexandria, regarded as the capital of knowledge. This was also the name of an unrelated startup Amazon had acquired in the 1990s, which sold web traffic data and continued to operate independently. After endless debates and lab testing, “Alexa” and “Amazon” became the top candidates for the wake word as the device moved into limited trials in the homes of Amazon employees at the start of 2013.
The devices employees received looked very much like the original Echo that would be introduced by Amazon less than two years later. The industrial designers at Lab126 called it the “Pringles can”—a cylinder elongated to create separation between the array of seven omnidirectional microphones at the top and the speakers at the bottom, with some fourteen hundred holes punctured in the metal tubing to push out air and sound. The device had an LED light ring at the top, another Bezos idea, which would light up in the direction of the person speaking, reproducing the social cue of looking at someone when they are talking to you. It was not an elegant-looking device, Bezos having instructed the designers to let function dictate the form.
The experimental Doppler devices in the homes of hundreds of Amazon employees were not smart—they were, by all accounts, slow and dumb. An Amazon manager named Neil Ackerman signed up for the internal beta, bringing one home to his family in early 2013. Both he and his wife had to sign several confidentiality agreements, promising they would turn it off and hide it if guests came over. Every week they had to fill out a spreadsheet, answering questions and listing what they asked it and how it responded. Ackerman’s wife called it “the thing.”
“We were both pretty skeptical about it,” he said. “It would hardly ever give me the right answer and the music coming out of it was inconsistent and certainly not the family favorites.” Inexplicably it seemed to best understand their son, who had a speech impediment.
Other early beta testers didn’t mince words either. Parag Garg, one of the first engineers to work on the Fire TV, took home a device and said it “didn’t work for shit and I didn’t miss it when it was gone. I thought, ‘Well, this thing is doomed.’ ” A manager on the Fire Phone recalls liking the look of the hardware, “but I could not foresee what it was going to be used for. I thought it was a stupid product.”
Two Doppler engineers recall another harrowing review—from Bezos himself. The CEO was apparently testing a unit in his Seattle home, and in a pique of frustration over its lack of comprehension, he told Alexa to go “shoot yourself in the head.” One of the engineers who heard the comment while reviewing interactions with the test device said: “We all thought it might be the end of the project, or at least the end of a few of us at Amazon.”
* * *
Alexa, it was clear, needed a brain transplant. Amazon’s ongoing efforts to make its product smarter would create a dogmatic battle inside the Doppler team and lead to its biggest challenge yet.
The first move was to integrate the technology of a third acquisition, a Cambridge, England–based artificial intelligence company called Evi (pronounced Evee). The startup was founded in 2005 as a question-and-answer tool called True Knowledge by British entrepreneur William Tunstall-Pedoe. As a university student, Tunstall-Pedoe had created websites like Anagram Genius, which automatically rearranged the letters in words to produce another word or phrase. The site was later used by novelist Dan Brown to create puzzles in The Da Vinci Code.
In 2012, inspired by Siri’s debut, Tunstall-Pedoe pivoted and introduced the Evi app for the Apple and Android app stores. Users could ask it questions by typing or speaking. Instead of searching the web for an answer like Siri, or returning a set of links, like Google’s voice search, Evi evaluated the question and tried to offer an immediate answer. The app was downloaded over 250,000 times in its first week and almost crashed the company’s servers. Apple threatened to kick it off the iOS app store for appearing “confusingly similar” to Siri, then relented when fans objected. Thanks to all this attention, Evi had at least two acquisition offers and a prospective investment from venture capitalists when Amazon won out in late 2012 with a rumored $26 million deal.
Evi employed a programming technique called knowledge graphs, or large databases of ontologies, which connect concepts and categories in related domains. If, for example, a user asked Evi, “What is the population of Cleveland?” the software interpreted the question and knew to turn to an accompanying source of demographic data. Wired described the technique as a “giant treelike structure” of logical connections to useful facts.
Putting Evi’s knowledge base inside Alexa helped with the kind of informal but culturally
common chitchat called phatic speech. If a user said to the device, “Alexa, good morning, how are you?” Alexa could make the right connection and respond. Tunstall-Pedoe said he had to fight with colleagues in the U.S. over the unusual idea of having Alexa respond to such social cues, recalling that “People were uncomfortable with the idea of programming a machine to respond to ‘hello.’ ”
Integrating Evi’s technology helped Alexa respond to factual queries, such as requests to name the planets in the solar system, and it gave the impression that Alexa was smart. But was it? Proponents of another method of natural language understanding, called deep learning, believed that Evi’s knowledge graphs wouldn’t give Alexa the kind of authentic intelligence that would satisfy Bezos’s dream of a versatile assistant that could talk to users and answer any question.
In the deep learning method, machines were fed large amounts of data about how people converse and what responses proved satisfying, and then were programmed to train themselves to predict the best answers. The chief proponent of this approach was an Indian-born engineer named Rohit Prasad. “He was a critical hire,” said engineering director John Thimsen. “Much of the success of the project is due to the team he assembled and the research they did on far-field speech recognition.”
Prasad was raised in Ranchi, the capital of the eastern India state of Jharkhand. He grew up in a family of engineers and got hooked on Star Trek at a young age. Personal computers weren’t common in India at the time, but at an early age he learned to code on a PC at the metallurgical and engineering consulting company where his father worked. Since communication in India was hampered by poor telecommunications infrastructure and high long-distance rates, Prasad decided to study how to compress speech over wireless networks when he moved to the U.S. to attend graduate school.
After graduating in the late 1990s, Prasad passed on the dot-com boom and worked for the Cambridge, Massachusetts–based defense contractor BBN Technologies (later acquired by Raytheon) on some of the first speech recognition and natural language systems. At BBN, he worked on one of the first in-car speech recognition systems and automated directory assistance services for telephone companies. In 2000, he worked on another system that automatically transcribed courtroom proceedings. Accurately recording conversation from multiple microphones placed around a courtroom introduced him to the challenges of far-field speech recognition. At the start of the project, he said that eighty out of every hundred words were incorrect; but within the first year, they cut it down to thirty-three.
Years later, as the Doppler team was trying to improve Alexa’s comprehension, Bill Barton, who led Amazon’s Boston office, introduced Prasad to Greg Hart. Prasad didn’t know much about Amazon and showed up for the interview in Seattle wearing a suit and tie (a minor faux pas) and with no clue about Amazon’s fourteen leadership principles (a bigger one). He expressed reservations about joining a large, plodding tech company, but by the time he returned to his hotel room, Hart had emailed him a follow-up note that promised, “We are essentially a startup. Even though we are part of a big company, we don’t act like one.”
Persuaded, Prasad joined to work on the problems of far-field speech recognition, but he ended up as an advocate for the deep learning model. Evi’s knowledge graphs were too regimented to be Alexa’s foundational response model; if a user says, “Play music by Sting,” such a system may think he is trying to say “bye” to the artist and get confused, Prasad later explained. By using the statistical training methods of deep learning, the system could quickly ascertain that when the sentence is uttered, the intent is almost certainly to blast “Every Breath You Take.”
But Evi’s Tunstall-Pedoe argued that knowledge graphs were the more practical solution and mistrusted the deep learning approach. He felt it was error-prone and would require an endless diet of training data to properly mold Alexa’s learning models. “The thing about machine learning scientists is that they never admit defeat because all of their problems can be solved with more data,” he explained. That response might carry a tinge of regret with it, because to the über product manager, Bezos himself, there was no question about which way time’s arrow was pointed—toward machine learning and deep neural networks. With its vast and sophisticated AWS data centers, Amazon was also in the unique position of being able to harness a large number of high-powered computer processors to train its speech models, exploiting its advantage in the cloud in a way few of its competitors could. Defeated, Tunstall-Pedoe ended up leaving Amazon in 2016.
Even though the deep learning approach won out, Prasad and his allies still had to solve the paradox that confronts all companies developing AI: they don’t want to launch a system that is dumb—customers won’t use it, and so won’t generate enough data to improve the service. But companies need that data to train the system to make it smarter.
Google and Apple solved the paradox in part by licensing technology from Nuance, using its results to train their own speech models and then afterward cutting ties with the company. For years, Google also collected speech data from a toll-free directory assistance line, 800-Goog-411. Amazon had no such services it could mine, and Greg Hart was against licensing outside technology—he thought it would limit the company’s flexibility in the long run. But the meager training data from the beta tests with employees amounted to speech from a few hundred white-collar workers, usually uttered from across the room in their noisy homes in the mornings and evenings when they weren’t at the office. The data was lousy, and there wasn’t enough of it.
Meanwhile Bezos grew impatient. “How will we even know when this product is good?” he kept asking in early 2013. Hart, Prasad, and their team created graphs that projected how Alexa would improve as data collection progressed. The math suggested they would need to roughly double the scale of their data collection efforts to achieve each successive 3 percent increase in Alexa’s accuracy.
That spring, only a few weeks after Rohit Prasad had joined the company, they brought a six-page narrative to Bezos that laid out these facts, proposed to double the size of the speech science team and postpone a planned launch from the summer into the fall. Held in Bezos’s conference room, the meeting did not go well.
“You are going about this the wrong way,” Bezos said after reading about the delay. “First tell me what would be a magical product, then tell me how to get there.”
Bezos’s technical advisor at the time, Dilip Kumar, then asked if the company had enough data. Prasad, who was calling into the meeting from Cambridge, replied that they would need thousands of more hours of complex, far-field voice commands. According to an executive who was in the room, Bezos apparently factored in the request to increase the number of speech scientists and did the calculation in his head in a few seconds. “Let me get this straight. You are telling me that for your big request to make this product successful, instead of it taking forty years, it will only take us twenty?”
Prasad tried to dance around it. “Jeff, that is not how we think about it.”
“Show me where my math is wrong!” Bezos said.
Hart jumped in. “Hang on, Jeff, we hear you, we got it.”
Prasad and other Amazon executives would remember that meeting, and the other tough interactions with Bezos during the development of Alexa, differently. But according to the executive who was there, the CEO stood up, said, “You guys aren’t serious about making this product,” and abruptly ended the meeting.
* * *
In the very same buildings in Seattle and Sunnyvale, California, where the Doppler team was trying to make Alexa smarter, Amazon’s campaign to build its own smartphone was careening toward disaster.
A few years before, Apple, Google, and Samsung had staked out large positions in the dawning smartphone market but had left the impression that terrain might remain for innovative newcomers. Typically, Jeff Bezos was not about to cede a critical strategic position in the unfolding digital terrain to other companies, especially when he believed the ground w
as still fertile for innovative approaches. In one brainstorming session he proposed a robot that could retrieve a carelessly discarded phone and drag it over to a wireless charger. (Some employees thought he was joking, but a patent on the idea was filed.) In another, he proposed a phone with an advanced 3D display, responsive to gestures in the air, instead of only taps on a touchscreen. It would be like nothing else in stores. Bezos clung to that idea, which would become the seed of the Fire Phone project.
The original designers settled on a handset with four infrared cameras, one in each corner of the phone’s face, to track the user’s gaze and present the illusion of a 3D image, along with a fifth camera on the back (because it could “see” from both sides of its head, the project was code-named Tyto, after a genus of owl). The custom-made Japanese cameras would cost $5 a handset, but Bezos envisioned a premium Amazon smartphone with top-of-the-line components.
Bezos met the Tyto team every few days for three years, at the same time he was meeting the Alexa team as frequently. He was infatuated with new technologies and business lines and loved spitballing ideas and reviewing the team’s progress. And while he was inordinately focused on customer feedback in other parts of Amazon’s business, Bezos did not believe that listening to them could result in dramatic product inventions, evangelizing instead for creative “wandering,” which he believed was the path to dramatic breakthroughs. “The biggest needle movers will be things that customers don’t know to ask for,” he would write years later in a letter to shareholders. “We must invent on their behalf. We have to tap into our own inner imagination about what’s possible.”
But many Tyto employees were skeptical of his vision for smartphones. No one was sure the 3D display was anything more than a gimmick and a major drain on the phone’s battery. Bezos also had some worrisome blind spots about smartphones. “Does anyone actually use the calendar on their phone?” he asked in one meeting. “We do use the calendar, yes,” someone who did not have several personal assistants replied.