by Gene Kim
“I’m surprised no one is talking about all the problems they’re having with environments or automated testing or the lack of production telemetry,” Kurt says. “We’ve built some amazing capabilities that other people can use too. But we can’t be the people with a solution, peddling them to people who don’t know they have a problem.”
Kurt looks stumped. And frustrated.
“I totally want to help with this,” Shannon says, raising her hand. “I’ve worked with a bunch of the Phoenix teams. I could swing by each one tomorrow to start asking them what their constraints are and any ideas they have on how to fix them.”
“Good, good,” Kurt says, writing down some notes in his notebook.
“I’d love to help too, Shannon,” Maxine says. “But Tom and I will be a little tied up on Monday, because Monday is Testing Day. I’m going to finally get my changes tested with the QA folks. Outside of that, I’m yours!” A full tray of beer pitchers and two more glasses of wine appear.
They are soon in deep conversation about technical debt and ideas on how to take advantage of Project Inversion. Maxine turns to see Erik grabbing the seat next to her.
He joins the conversation as if he’s been there all along. “With Project Inversion, you are all on the beginning of a great journey. Every tech giant has nearly been killed by technical debt. You name it: Facebook, Amazon, Netflix, Google, Microsoft, eBay, LinkedIn, Twitter, and so many more. Like the Phoenix Project, they became so encumbered by technical debt they could no longer deliver what their customers demanded,” Erik says. “The consequences would have been fatal—and for every survivor, there are companies like Nokia who fell from the loftiest heights, killed by technical debt.
“Technical debt is a fact of life, like deadlines. Business people understand deadlines, but often are completely oblivious that technical debt even exists. Technical debt is inherently neither good nor bad—it happens because in our daily work, we are always making trade-off decisions,” he says. “To make the date, sometimes we take shortcuts, or skip writing our automated tests, or hard-code something for a very specific case, knowing that it won’t work in the long-term. Sometimes we tolerate daily workarounds, like manually creating an environment or manually performing a deployment. We make a grave mistake when we don’t realize how much this impacts our future productivity.”
Erik looks around the table, pleased that everyone is listening intently to his every word.
“All the tech giants, at some point in their history, have used the feature freeze to massively rearchitect their systems. Consider Microsoft in the early 2000s—that was when computer worms were routinely taking down the internet, most famously CodeRed, Nimda, and of course SQL Slammer, which infected and crashed nearly 100,000 servers around the world in less than ten minutes. CEO Bill Gates was so concerned that he wrote a famous internal memo to every employee, stating that if a developer has to choose between implementing a feature or improving security, they must choose security, because nothing less than the survival of the company was at stake. And thus began the famous security stand-down that affected every product at Microsoft. Interestingly, Satya Nadella, CEO of Microsoft, still has a culture that if a developer ever has a choice between working on a feature or developer productivity, they should always choose developer productivity.
“Back to 2002—that same year, Amazon CEO Jeff Bezos wrote his famous memo to all technologists, stating that they must rearchitect their systems so that all data and functionality are provided through services. Their initial focus was their OBIDOS system, originally written in 1996, which held almost all the business logic, display logic, and functionality that made Amazon.com so famous.
“But over time, it became too complected for teams to be able to work independently. Amazon likely spent over $1 billion over six years rearchitecting all their internal services to be decoupled from each other. The result was astonishing. By 2013 they were performing nearly 136,000 deployments per day. Interesting that these CEOs I mention all have a software background, isn’t it?
“Contrast that with the tragic story of Nokia. When their market was disrupted by Apple and Android, they spent hundreds of millions of dollars hiring developers and investing in rolling out Agile. But they did so without realizing their real problem: technical debt in the form of an architecture where developers could not be productive. They lacked the conviction to rebuild the foundations of their software systems. Just like at Amazon in 2002, every software team at Nokia was unable to build what they needed to because they were hamstrung by the Symbian platform.
“In 2010, Risto Siilasmaa was a board director at Nokia. When he learned that generating a Symbian build took a whole forty-eight hours, he said that it felt like someone hit him in the head with a sledgehammer,” Erik says. “He knew that if it took two days for anyone to determine whether a change worked or would have to be redone, there was a fundamental and fatal flaw in their architecture that doomed their near-term profitability and long-term viability. They could have had twenty times more developers, and it wouldn’t have made them go any faster.
Erik pauses. “It’s incredible. Sensei Siilasmaa knew that all the hopes and promises made by the engineering organization was a mirage. Even though there were numerous internal efforts to migrate off of Symbian, it was always shot down by the top executives until it was too late.
“Business people can see features or apps, so getting funding for those is easy,” he continues. “But they don’t see the vast architectures underneath that support them, connecting systems, teams, and data to each other. And underneath that is something extraordinarily important: the systems that developers use in their daily work to be productive.
“It’s funny: the tech giants assign their very best engineers to that bottom layer, so that every developer can benefit. But at Parts Unlimited, the very best engineers work on features at that top layer, with no one besides interns on the bottom working on Dev productivity.
Erik continues, “So your mission is clear. Everyone has been told to pay down technical debt, which will help you realize the First Ideal of Locality and Simplicity and the Second Ideal of Focus, Flow, and Joy. But almost certainly, you will have to master the Third Ideal of Improvement of Daily Work.” Then he gets up and leaves as quickly as he joined them.
Everyone watches him leave. Then Kirsten says, “Is he coming back?”
Cranky Dave throws his hands in the air. “What happened at Nokia is happening here. Two years ago, we could implement a significant feature in two to four weeks. And we delivered a ton of great stuff. I remember those days! If you had a great idea, we could get it done.
“But now? That same class of feature takes twenty to forty weeks. Ten times longer! No wonder everyone’s so pissed off at us,” Cranky Dave yells. “We’ve hired more engineers, but it feels like we’re getting less and less done. And not only are we slower, those changes are incredibly dangerous to make.”
“This makes sense,” Kirsten says. “By almost any measure, productivity is flat or down. Feature due date performance is way down. I did some research since our last meeting—I asked my project managers to sample a couple of features and find out how many teams were required to implement them. The average number of teams required was 4.2, which is shocking. Then they told me that many had to interact with over eight teams,” she says. “We’ve never formally tracked this, but most of my people say that these numbers are definitely higher than they were two years ago.”
Maxine’s jaw drops. Absolutely no one can get anything done if they have to work with eight other teams all the time, she realizes. Just like the extended warrantee feature she started working on with Tom.
“Well, Project Inversion is our shot to fix some of these things and to engineer our way out of this,” Kurt says. “Shannon will find out what the Phoenix teams need help on. How about us? If someone gave us the authority, and we were given infinite resources for one month, what would we do?”
Maxine smiles a
s she hears the suggestions fly fast and furious. They start making a list: Every developer uses a common build environment. Every developer is supported by a continuous build and integration system. Everyone can run their code in production-like environments. Automated test suites are built to replace manual testing, liberating QA people to do higher value work. Architecture is decoupled to liberate feature teams, so developers can deliver value independently. All the data that teams need is put in easily consumed APIs …
Shannon looks over the list they’ve generated, smiling. “I’ll post the updated list when I’m done interviewing the teams tomorrow. This is exciting,” she says. “This is what the developers want, even if they can’t articulate it. And that’s something I can help them with!”
It’s a great list, Maxine thinks. Everyone’s enthusiasm is evident.
“That is indeed a great list, Shannon, which could dramatically change the dynamics of how engineers work,” Erik says, sitting down next to Kirsten once again. Maxine looks around, wondering where he came from. Gesturing at Kirsten, he continues, “But consider the forces arrayed against you. The entire Project Management Office aims to keep projects on-time and on-budget, following the rules and enforcing the promises written long ago. Look at how Chris’ direct reports act—despite Project Inversion, they keep working on the features because they’re afraid of slipping their dates.
“Why? A century ago, when mass production revolutionized industry, the role of the leader was to design and decompose the work and to verify that it was performed correctly by armies of interchangeable workers, who were paid to use their hands, not their heads. Work was atomized, standardized, and optimized. And workers had little ability to improve the system they worked within.
“Which is strange, isn’t it?” Erik muses. “Innovation and learning occur at the edges, not the core. Problems must be solved on the front-lines, where daily work is performed by the world’s foremost experts who confront those problems most often.
“And that’s why the Third Ideal is Improvement of Daily Work. It is the dynamic that allows us to change and improve how we work, informed by learning. As Sensei Dr. Steven Spear said, ‘It is ignorance that is the mother of all problems, and the only thing that can overcome it is learning.’
“The most studied example of a learning organization is Toyota,” he continues. “The famous Andon cord is just one of their many tools that enable learning. When anyone encounters a problem, everyone is expected to ask for help at any time, even if it means stopping the entire assembly line. And they are thanked for doing so, because it is an opportunity to improve daily work.
“And thus problems are quickly seen, swarmed, and solved, and then those learnings are spread far and wide, so all may benefit,” he says. “This is what enables innovation, excellence, and outlearning the competition.
“The opposite of the Third Ideal is someone who values process compliance and TWWADI,” he says with a big smile. “You know, ‘The Way We’ve Always Done It.’ It’s the huge library of rules and regulations, processes and procedures, approvals and stage gates, with new rules being added all the time to prevent the latest disaster from happening again.
“You may recognize them as rigid project plans, inflexible procurement processes, powerful architecture review boards, infrequent release schedules, lengthy approval processes, strict separation of duties …
“Each adds to the coordination cost for everything we do, and drives up our cost of delay. And because the distance from where decisions are made and where work is performed keeps growing, the quality of our outcomes diminish. As Sensei W. Edwards Deming once observed, ‘a bad system will beat a good person every time.’
“You may have to change old rules that no longer apply, change how you organize your people and architect your systems,” he continues. “For the leader, it no longer means directing and controlling, but guiding, enabling, and removing obstacles. General Stanley McChrystal massively decentralized decision-making authority in the Joint Special Operations Task Force to finally defeat Al Qaeda in Iraq, their much smaller but nimbler adversary. There the cost of delay was not measured in money, but in human lives and the safety of the citizens they were tasked to protect.
“That’s not servant leadership, it’s transformational leadership,” Erik says. “It requires understanding the vision of the organization, the intellectual stimulation to question the basic assumptions of how work is performed, inspirational communication, personal recognition, and supportive leadership.
“Some think it’s about leaders being nice,” Erik guffaws. “Nonsense. It’s about excellence, the ruthless pursuit of perfection, the urgency to achieve the mission, a constant dissatisfaction with the status quo, and a zeal for helping those the organization serves.
“Which brings us to the Fourth Ideal of Psychological Safety. No one will take risks, experiment, or innovate in a culture of fear, where people are afraid to tell the boss bad news,” Erik says, laughing. “In those organizations, novelty is discouraged, and when problems occur, they ask ‘Who caused the problem?’ They name, blame, and shame that person. They create new rules, more approvals, more training, and, if necessary, rid themselves of the ‘bad apple,’ fooling themselves that they’ve solved the problem,” he says.
“The Fourth Ideal asserts that we need psychological safety, where it is safe for anyone to talk about problems. Researchers at Google spent years on Project Oxygen and found that psychological safety was one of the most important factors of great teams: where there was confidence that the team would not embarrass, reject, or punish someone for speaking up.
“When something goes wrong, we ask ‘what caused the problem,’ not ‘who.’ We commit to doing what it takes to make tomorrow better than today. As Sensei John Allspaw says, every incident is a learning opportunity, an unplanned investment that was made without our consent.
“Picture this scenario: You are in an organization where everyone is making decisions, solving important problems every day, and teaching others what they’ve learned,” Erik says. “Your adversary is an organization where only the top leaders make decisions. Who will win? Your victory is inevitable.
“It’s so easy for leaders to talk about the platitudes of creating psychological safety, empowering and giving a voice to the front-line worker,” he says. “But repeating platitudes isn’t enough. The leader must constantly model and coach and positively reinforce these desired behaviors every day. Psychological safety slips away so easily, like when the leader micromanages, can’t say ‘I don’t know,’ or acts like a know-it-all, pompous jackass. And it’s not just leaders, it’s also how one’s peers behave.”
A bartender walks up to Erik and whispers something in his ear. Erik mutters, “Again?” He looks up and says, “I’ll be right back. Something requires my attention,” and walks away with the bartender.
They stare at Erik walking away. Dwayne eventually says, “He’s so right about the Third and Fourth Ideal. What can we do about the culture of fear that’s all around us? Look at what happened to Chad. He tried to do the right thing and got fired. I probably have more reasons to dislike Chad than any of you—those rolling network outages during the day drove me crazy. But firing Chad doesn’t do a damned thing to make those outages less likely in the future.
“I did some asking around to find out what actually happened,” Dwayne continues. “Apparently, Chad had worked four nights in a row, in addition to working his normal daytime hours, to support the store modernization initiative. When I asked why, he told me he didn’t want the store teams to get dinged on their status reports because of him.”
Kirsten raises an eyebrow. Dwayne continues, “His manager kept badgering him to go home, he finally went home on time on Wednesday. But he was back online at midnight because he didn’t want to let the store launch team down. He was so worried about all the work piling up, in tickets and in the chat rooms, he wasn’t sleeping through the night anymore.
“So he com
es into work early on Thursday morning, still tired from all those late nights, and he takes on an urgent internal networking change that needed to be made,” he says. “He opens his laptop, and there’s like thirty terminal windows open from all the things he’s working on. He types a command into the terminal window and hits enter. And it turns out, he typed it into the wrong window.
“Blam! Most of the Tier 2 business systems become inaccessible, including Data Hub,” he says. “The next day, he’s fired. Does that seem right to you? Does that seem fair and just?”
“Oh, my God,” Maxine blurts out, horrified. She knows exactly how this feel. She’s done it several times in her career. You type something, hit enter, and immediately realize you’ve made a huge mistake, but it’s too late. She’s accidentally deleted a customer database table thinking it was the test database. She’s accidentally rebooted the wrong production server, taking down an order entry system for an afternoon. She’s deleted wrong directories, shut down wrong server clusters, and disabled the wrong login credentials.
Each time, it felt like her blood turned to ice, followed by panic. Once, earlier in her career, when she accidentally deleted the production source control repository, she literally wanted to crawl under her desk. Because of the OS it ran on, she knew no one would ever know it was her. But despite being afraid to tell anyone about it, she told her manager. It was one of the scariest things she had done as a young engineer.
“That really, really sucks, Dwayne,” says Brent. “That could have been me … Seriously, every week I’m in situations where I could have made that same mistake.”
She says, “It could have been any of us. Our systems are so tightly coupled around here, even small changes can have a catastrophic impact. And worse, Chad couldn’t ask for help when he obviously needed it. No one can sustain those insanely long working hours. Who wouldn’t start making mistakes if you can’t even sleep anymore?”
“Yes!” Dwayne exclaims. “How did we get into this position where someone is so overworked that they’re working four nights in a row? What sort of expectations are being set when someone can’t take a day off when they need to? And what sort of message are we sending when the reward for caring so much is that we fire you?”