by Brad Stone
While the offices clustered around the intersection of Terry Avenue North and Harrison Street were largely anonymous, inside they bore all the distinguishing marks of a unique and idiosyncratic corporate culture. Employees wore color-coded badges around their necks signifying their seniority at the company (blue for those with up to five years of tenure, yellow for up to ten, red for up to fifteen), and the offices and elevators were decorated with posters delineating Bezos’s fourteen sacrosanct leadership principles.
Within these walls ranged Bezos himself, forty-six years old at the time, carrying himself in such a way as to always exemplify Amazon’s unique operating ideology. The CEO, for example, went to great lengths to illustrate Amazon’s principal #10, “frugality”: Accomplish more with less. Constraints breed resourcefulness, self-sufficiency, and invention. There are no extra points for growing headcount, budget size, or fixed expense. His wife, MacKenzie, drove him to work most days in their Honda minivan, and when he flew with colleagues on his private Dassault Falcon 900EX jet, he often mentioned that he personally, not Amazon, had paid for the flight.
If Bezos took one leadership principle most to heart—which would also come to define the next half decade at Amazon—it was principal #8, “think big”: Thinking small is a self-fulfilling prophecy. Leaders create and communicate a bold direction that inspires results. They think differently and look around corners for ways to serve customers. In 2010, Amazon was a successful online retailer, a nascent cloud provider, and a pioneer in digital reading. But Bezos envisioned it as much more. His shareholder letter that year was a paean to the esoteric computer science disciplines of artificial intelligence and machine learning that Amazon was just beginning to explore. It opened by citing a list of impossibly obscure terms such as “naïve Bayesian estimators,” “gossip protocols,” and “data sharding.” Bezos wrote: “Invention is in our DNA and technology is the fundamental tool we wield to evolve and improve every aspect of the experience we provide our customers.”
Bezos wasn’t only imagining these technological possibilities. He was also attempting to position Amazon’s next generation of products directly on its farthest frontier. Around this time, he started working intensively with the engineers at Lab126, Amazon’s Silicon Valley R&D subsidiary, which had developed the company’s first gadget, the Kindle. In a flurry of brainstorming sessions, he initiated several projects to complement the Kindle and the coming Kindle Fire tablets, which were known internally at the time as Project A.
Project B, which became Amazon’s ill-fated Fire Phone, would use an assembly of front-facing cameras and infrared lights to conjure a seemingly three-dimensional smartphone display. Project C, or “Shimmer,” was a desk lamp–shaped device designed to project hologram-like displays onto a table or ceiling. It proved unfeasibly expensive and was never launched.
Bezos had peculiar ideas about how customers might interact with these devices. The engineers working on the third version of the Kindle discovered this when they tried to kill a microphone that was planned for the device, since no features were slated to actually use it. But the CEO insisted that the microphone remain. “The answer I got is that Jeff thinks in the future we’ll talk to our devices,” said Sam Bowen, then a Kindle hardware director. “It felt a bit more like Star Trek than reality.”
Designers convinced Bezos to lose the microphone in subsequent versions of the Kindle, but he clung to his belief in the inevitability of conversational computing and the potential of artificial intelligence to make it practical. It was a trope in all his favorite science fiction, from TV’s Star Trek (“computer, open a channel”) to authors like Arthur C. Clarke, Isaac Asimov, and Robert A. Heinlein whose books lined the library of hundreds of volumes in his lakefront Seattle-area home. While others read these classics and only dreamed of alternate realities, Bezos seemed to consider the books blueprints for an exciting future. It was a practice that would culminate in Amazon’s defining product for a new decade: a cylindrical speaker that sparked a wave of imitators, challenged norms around privacy, and changed the way people thought about Amazon—not only as an e-commerce giant, but as an inventive technology company that was pushing the very boundaries of computer science.
The initiative was originally designated inside Lab126 as Project D. It would come to be known as the Amazon Echo, and by the name of its virtual assistant, Alexa.
* * *
As with several other projects at Amazon, the origins of Project D can be traced back to discussions between Bezos and his “technical advisor” or TA, the promising executive handpicked to shadow the CEO. Among the TA’s duties were to take notes in meetings, write the first draft of the annual shareholder letter, and learn by interacting with the master closely for more than a year. In the role from 2009 to 2011 was Amazon executive Greg Hart, a veteran of the company’s earliest retail categories, like books, music, DVDs, and video games. Originally from Seattle, Hart had attended Williams College in Western Massachusetts and, after a stint in the ad world, returned home at the twilight of the city’s grunge era, sporting a goatee and a penchant for flannel shirts. By the time he was following Bezos around, the facial hair was gone and Hart was a rising corporate star. “You sort of feel like you’re an assistant coach watching John Wooden, you know, perhaps the greatest basketball coach ever,” Hart said of his time as the TA.
Hart remembered talking to Bezos about speech recognition one day in late 2010 at Seattle’s Blue Moon Burgers. Over lunch, Hart demonstrated his enthusiasm for Google’s voice search on his Android phone by saying, “pizza near me,” and then showing Bezos the list of links to nearby pizza joints that popped up on-screen. “Jeff was a little skeptical about the use of it on phones, because he thought it might be socially awkward,” Hart remembered. But they discussed how the technology was finally getting good at dictation and search.
At the time, Bezos was also excited about Amazon’s growing cloud business, asking all of his executives, “What are you doing to help AWS?” Inspired by the conversations with Hart and others about voice computing, he emailed Hart, device vice president Ian Freed, and senior vice president Steve Kessel on January 4, 2011, linking the two topics: “We should build a $20 device with its brains in the cloud that’s completely controlled by your voice.” It was another idea from the boss who seemed to have a limitless wellspring of them.
Bezos and his employees riffed on the idea over email for a few days, but no further action was taken, and it could have ended there. Then a few weeks later, Hart met with Bezos in a sixth-floor conference room in Amazon’s headquarters, Day 1 North, to discuss his career options. His tenure as TA was wrapping up, so they discussed several possible opportunities to lead new initiatives at the company, including positions in Amazon’s video streaming and advertising groups. Bezos jotted their ideas down on a whiteboard, adding a few of his own, and then started to apply his usual criteria to assess their merit: If they work, will they grow to become big businesses? If the company didn’t pursue them aggressively now, would it miss an opportunity? Eventually Bezos and Hart crossed off all the items on the list except one—pursuing Bezos’s idea for a voice-activated cloud computer.
“Jeff, I don’t have any experience in hardware, and the largest software team I’ve led is only about forty people,” Hart recalled saying.
“You’ll do fine,” Bezos replied.
Hart thanked him for the vote of confidence and said, “Okay, well, remember that when we screw up along the way.”
Before they parted, Bezos illustrated his idea for the screenless voice computer on the whiteboard. The first-ever depiction of an Alexa device showed the speaker, microphone, and a mute button. And it identified the act of configuring the device to a wireless network, since it wouldn’t be able to listen to commands right out of the box, as a challenge requiring further thought. Hart snapped a photo of the drawing with his phone.
Bezos would remain intimately involved in the project, meeting with the team as frequently as
every other day, making detailed product decisions, and authorizing the investment of hundreds of millions of dollars in the project before the first Echo was ever released. Using the German superlative, employees referred to him as the über product manager.
But it was Greg Hart who ran the team, just across the street from Bezos’s office, in Fiona, the Kindle building. Over the next few months, Hart hired a small group from in and outside the company, sending out emails to prospective hires with the subject line “Join my mission” and asking interview questions like “How would you design a Kindle for the blind?” Then, just as obsessed with secrecy as his boss, he declined to specify what product candidates would be working on. One interviewee recalled guessing that it was Amazon’s widely rumored smartphone and said that Hart replied, “There’s another team building a phone. But this is way more interesting.”
One early recruit was Amazon engineer Al Lindsay, who in a previous job had written some of the original code for telco US West’s voice-activated directory assistance. Lindsay spent his first three weeks on the project on vacation at his cottage in Canada, writing a six-page narrative that envisioned how outside developers might program their own voice-enabled apps that could run on the device. Another internal recruit, Amazon manager John Thimsen, signed on as director of engineering and coined a formal code name for the initiative, Doppler, after the Project D designation. “At the start, I don’t think anybody really expected it to succeed, to be honest with you,” Thimsen told me. “But to Greg’s credit, halfway through, we were all believers.”
The initial Alexa crew worked with a feverish sense of urgency due to their impatient boss. Unrealistically, Bezos wanted to release the device in six to twelve months. He would have a good reason to hurry. On October 4, 2011, just as the Doppler team was coming together, Apple introduced the Siri virtual assistant in the iPhone 4S, the last passion project of cofounder Steve Jobs, who died of cancer the next day. That the resurgent Apple had the same idea of a voice-activated personal assistant was both validating for Hart and his employees and discouraging, since Siri was first to market and with initial mixed reviews. The Amazon team tried to reassure themselves that their product was unique, since it would be independent from smartphones. Perhaps a more significant differentiator though was that Siri unfortunately could no longer have Jobs’s active support, while Alexa would have Bezos’s sponsorship and almost maniacal attention inside Amazon.
To speed up development and meet Bezos’s goals, Hart and his crew started looking for startups to acquire. It was a nontrivial challenge, since Nuance, the Boston-based speech giant whose technology Apple had licensed for Siri, had grown over the years by gobbling up the top American speech companies. Doppler execs tried to learn which of the remaining startups were promising by asking prospective targets to voice-enable the Kindle digital book catalog, then studying their methods and results. The search led to several rapid-fire acquisitions over the next two years, which would end up shaping Alexa’s brain and even the timbre of its voice.
The first company Amazon bought, Yap, a twenty-person startup based in Charlotte, North Carolina, automatically translated human speech such as voicemails into text, without relying on a secret workforce of human transcribers in low-wage countries. Though much of Yap’s technology would be discarded, its engineers would help develop the technology to convert what customers said to Doppler into a computer-readable format. During the prolonged courtship, Amazon execs tormented Yap execs by refusing to disclose what they’d be working on. Even a week after the deal closed, Al Lindsay was with Yap’s engineers at an industry conference in Florence, Italy, where he insisted that they pretend they didn’t know him, so that no one could catch on to Amazon’s newfound interest in speech technology.
After the purchase was finalized for around $25 million, Amazon dismissed the company’s founders but kept its speech science group in Cambridge, Massachusetts, making it the seed of a new R&D office in Kendall Square, near MIT. Yap engineers flew to Seattle, walking into a conference room on the first floor of Fiona with locked doors and closed window blinds. There Greg Hart finally described “this little device, about the size of a Coke can, that would sit on your table and you could ask it natural language questions and it would be a smart assistant,” recalled Yap’s VP of research, Jeff Adams, a two-decade veteran of the speech industry. “Half of my team were rolling their eyes, saying ‘oh my word, what have we gotten ourselves into.’ ”
After the meeting, Adams delicately told Hart and Lindsay that their goals were unrealistic. Most experts believed that true “far-field speech recognition”—comprehending speech from up to thirty-two feet away, often amid crosstalk and background noise—was beyond the realm of established computer science, since sound bounces off surfaces like walls and ceilings, producing echoes that confuse computers. The Amazon executives responded by channeling Bezos’s resolve. “They basically told me, ‘We don’t care. Hire more people. Take as long as it takes. Solve the problem,’ ” recalled Adams. “They were unflappable.”
* * *
A few months after the Yap purchase, Greg Hart and his colleagues acquired another piece of the Doppler puzzle. It was the technological antonym of Yap, which converted speech into text. Instead, the Polish startup Ivona generated computer-synthesized speech that resembled a human voice.
Ivona was founded in 2001 by Lukasz Osowski, a computer science student at the Gdan´sk University of Technology. Osowski had the notion that so-called “text to speech,” or TTS, could read digital texts aloud in a natural voice and help the visually impaired in Poland appreciate the written word. With a younger classmate, Michal Kaszczuk, he took recordings of an actor’s voice and selected fragments of words, called diphones, and then blended or “concatenated” them together in different combinations to approximate natural-sounding words and sentences that the actor might never have uttered.
The Ivona founders got an early glimpse of how powerful their technology could be. While students, they paid a popular Polish actor named Jacek Labijak to record hours of speech to create a database of sounds. The result was their first product, Spiker, which quickly became the top-selling computer voice in Poland. Over the next few years, it was used widely in subways, elevators, and for robocall campaigns. Labijak subsequently began to hear himself everywhere and regularly received phone calls in his own voice urging him, for example, to vote for a candidate in an upcoming election. Pranksters manipulated the software to have him say inappropriate things and posted the clips online, where his children discovered them. The Ivona founders then had to renegotiate the actor’s contract after he angrily tried to withdraw his voice from the software. (Today “Jacek” remains one of the Polish voices offered by AWS’s Amazon Polly computer voice service.)
In 2006, Ivona began to enter and repeatedly win the annual Blizzard Challenge, a competition for the most natural computer voice, organized by Carnegie Mellon University. By 2012, Ivona had expanded into twenty other languages and had over forty voices. After learning of the startup, Greg Hart and Al Lindsay diverted to Gdan´sk on their trip through Europe looking for acquisition targets. “From the minute we walked into their offices, we knew it was a culture fit,” Lindsay said, pointing to Ivona’s progress in a field where researchers often got distracted by high-minded pursuits. “Their scrappiness allowed them to look outside pure academia and not be blinded by science.”
The purchase, for around $30 million, was consummated in 2012 but kept secret for a year. The Ivona team and the growing number of speech engineers Amazon would hire for its new Gdan´sk R&D center were put in charge of crafting Doppler’s voice. The program was micromanaged by Bezos himself and subject to the CEO’s usual curiosities and whims.
At first, Bezos said he wanted dozens of distinct voices to emanate from the device, each associated with a different goal or task, such as listening to music or booking a flight. When that proved impractical, the team considered lists of characteristics they wanted in a si
ngle personality, such as trustworthiness, empathy, and warmth, and determined those traits were more commonly associated with a female voice.
To develop this voice and ensure it had no trace of a regional accent, the team in Poland worked with an Atlanta area–based voice-over studio, GM Voices, the same outfit that had helped turn recordings from a voice actress named Susan Bennett into Apple’s agent, Siri. To create synthetic personalities, GM Voices gave female voice actors hundreds of hours of text to read, from entire books to random articles, a mind-numbing process that could stretch on for months.
Greg Hart and colleagues spent months reviewing the recordings produced by GM Voices and presented the top candidates to Bezos. They ranked the best ones, asked for additional samples, and finally made a choice. Bezos signed off on it.
Characteristically secretive, Amazon has never revealed the name of the voice artist behind Alexa. I learned her identity after canvasing the professional voice-over community: Boulder-based singer and voice actress Nina Rolle. Her professional website contained links to old radio ads for products such as Mott’s Apple Juice and the Volkswagen Passat—and the warm timbre of Alexa’s voice was unmistakable. Rolle said she wasn’t allowed to talk to me when I reached her on the phone in February of 2021. And when I asked Amazon to speak with her, they declined.
* * *
While the Doppler team hired engineers and acquired startups, nearly every other aspect of the product was hotly debated in Amazon’s offices in Seattle and in Lab126 in Silicon Valley. In one of the earliest Doppler meetings, Greg Hart identified the ability to play music with a voice command as the device’s marquee feature. Bezos “agreed with that framework but he stressed that music may be like 51 percent, but the other 49 percent are going to be really important,” Hart said.