The Unicorn Project
Page 30
Everything is working as designed. She smiles as she hears a smattering of applause throughout the room but continues to stare at the graph.
She frowns. The number of completed orders graph has flatlined, stuck at 250. She looks at the other graphs to see if they’re stuck too, but they’re still climbing. Maxine sees a bunch of people crowded around the TV, pointing at the stuck graph.
Something is definitely going wrong.
“Let’s have some quiet here!” Wes hollers out. He remains silent for a couple of moments before he turns around and finally says, “I need people to try ordering products on the web and on mobile and tell me what is actually happening! Something is preventing orders from going through!” Maxine already has the app open on her phone. She hits the Add to Shopping Cart button and blinks in surprise. She calls out, “Mobile app crashes when you add an item to the cart on an iPhone … app crashes and disappears.”
“Dammit,” she hears someone say from the other side of the room. Someone else calls out, “Getting error message on Android. I see a dialog box that says ‘An error has occurred.’”
Right next to her, Shannon hollers out, “Web shopping cart is generating an error—web page renders after you hit submit, but I’m getting a blank webpage! I think something on the back end is erroring out when we query whether items are available to ship.”
Wes says from the front of the room, “Thank you, Shannon. Get all these screenshots into the #launch channel. Okay, everyone, listen up! We’re getting errors on all client platforms—Shannon thinks it’s one of the back-end calls we make: maybe the ‘available to promise’ API call or ‘available to ship.’ Anyone have any ideas?”
Maxine jumps into action, appreciating how great it is that Wes is running the war room. Yeah, he’s cantankerous, she thinks, but he’s handled more outages than everyone else in the room combined. Having that type of experience during this high-stakes launch is a very good thing. We developers are great at what we do, but these types of crises are a part of everyday life for Ops.
It doesn’t take long to confirm that Shannon’s hypothesis is correct—it was a problem in the order entry back-end systems. All the systems in that particular cluster are pegged at one hundred percent CPU usage; unfortunately, the system being hit is part of the main ERP, which handles almost all the core financials of the company. It’s been running for over thirty years, but it’s stuck on a version that is almost fifteen years old. It’s been so customized that it’s been impossible to upgrade. At least it’s put on newer hardware every five years. But there’s no easy way to throw more CPU cores at it to speed it up.
Apparently, even the small one percent promotion is causing it to get backed up. Maxine sees that queries are taking longer and longer to return, and client requests start timing out. All those clients start resending the queries, causing even more requests to overload the back-end database.
“Thundering herd problem,” Wes mutters, referring to when simultaneous client retries end up killing a server. “We can’t do anything on the back end. How do we get all the clients to back off on the retries?”
“We can’t change the mobile apps, but we can get the e-commerce servers to wait longer before they retry,” Brent says. Wes points at Brent and Maxine and says, “Do it.”
Maxine and Brent work with the e-commerce teams to push out new configuration files to every web server. They are able to push out all these changes into production in less than ten minutes.
Luckily, this is enough to stave off disaster. Maxine watches in relief as the database error rates start decreasing and the number of completed orders starts creeping up again. Several other things go wrong over the next two hours, but none of them are as heart-stopping as the ‘available to promise’ server issue that she and Brent had to deal with.
Another forty-five minutes later, they cross over their goal of three thousand completed orders, grossing a quarter million dollars in revenue, and the orders are still coming in strong. Maggie must have snuck out, because two hours later Maxine sees her come back into the room with a bunch of people carrying champagne bottles. Maggie opens one up and starts pouring glasses, handing the first to Maxine.
After everyone has a glass in hand, Maggie raises hers with a big smile. “Holy cow, everyone. What a day! And what an amazing team effort! I want to share with you some of the early results, and, wow, they are great … people are continuing to respond to the promotions, but at this point, almost a third of the people responded to our campaign. This is, without a doubt, the highest conversion rate we’ve ever achieved by at least a factor of five!”
She pulls out her phone and peers at the screen. “Here are some early calculations from the team. Over twenty percent of people who received our offer went to view the products we recommended, and over six percent purchased. We’ve never seen any numbers like this! So thank you to everyone here who helped make this happen.
“And remember, almost all the items we promoted are high-margin items or were sitting on shelves gathering dust. So each sale we made today will have an unusually large effect on profits!” Maggie cheers and drains her glass. Everyone laughs and follows suit.
She says, “Based on these results, the Unicorn promotion campaign to our entire customer base on Black Friday is a GO! If the results are anywhere near what we saw in this test campaign, we are going to have a blowout holiday season …
“Uh, just a reminder, this is insider information. If you use this information to trade Parts Unlimited stock, you can go to prison. Dick Landry, our CFO, told me to tell you that he will assist in your prosecution as per your employment contract,” she says, and then smiles. “But having said that, there’s no doubt we’re going to crush it on Black Friday!”
Everyone cheers loudly again, including Maxine. Maggie motions for everyone to quiet down and asks Kurt and Maxine to speak. Maxine laughs, motioning Kurt to go. He says, “What an amazing effort, everyone! I’m so proud. Maxine?”
Maxine hadn’t wanted to say anything, but being cornered like this, she stands up and raises her glass. “Here’s to the Rebellion showing the ancient, powerful order how kickass engineering work is done!”
Again, everyone cheers and laughs. When that all dies down, Maxine says, “Okay, enough of that. On Black Friday we can safely expect one hundred times the load as today. We’ll surely run into tons of problems we’ve never encountered before, so we’ve got our work cut out for us between now and then. Let’s figure out how to best prepare for it.”
Kurt adds, “I’d like to send as many people home on time tomorrow as possible, given that it’s Thanksgiving on Thursday. So let’s get to work! And we’ll need people in the office early on Friday to support the launch.”
They agree to stagger the emails and mobile app notifications to prevent the systems from being slammed all at once and to better protect those unexpectedly delicate back-end servers. Brent comes up with an idea to reconfigure the load balancers to rate-limit the transactions. This will cause customer errors on the mobile app and e-commerce servers, but everyone agrees this is far better than those back-end systems crashing again.
“We’ll get on it. I think we’ll be in good shape and get everyone out of here in time for Thanksgiving!” Brent says with a big smile. “Happy Thanksgiving, everyone!”
As Brent predicted, all the work is done before five the next day. With just a couple of exceptions, people start heading out. Maxine is making the rounds, trying to shoo the stragglers home. It’s the day before Thanksgiving, and Maxine wants to get out of here by five thirty. She’s proud that she even got Brent to leave.
One team that couldn’t leave were the data analysts. Now that the one percent test proved to be a smashing success, they had to finish generating all their recommendations for millions of customers by Friday. The resulting compute loads on Panther keep growing, and they keep updating the promotions data in the Narwhal data platform. Maxine thinks with a grin, We’re racking up a heck of a bill with the cloud computing
providers, but absolutely no one in Marketing is complaining because the business benefits are so spectacular.
She swings in to say goodbye to Kurt but freezes mid-stride when she sees Sarah having a heated discussion with him.
“… and I walk around this building after five and there’s barely anyone here. Kurt, I don’t know if you realize this, but the company is on the verge of extinction. We need everyone pulling their weight,” Sarah says, fuming in righteous indignation. “I think we need some mandatory overtime. Buy them more pizza and they’ll be happy to stay and do the work.
“And if that weren’t bad enough,” she continues, “I just saw a bunch of people sitting around reading books! We don’t pay people to read books; we pay them to do work. That should be pretty clear, right, Kurt?” Kurt’s expression remains deadpan.
“You’ll have to bring that up with Chris. Banning books from the workplace is above my paygrade.” She gives him a dirty look and storms out.
Kurt makes a motion to Maxine, indicating that he wants to hang himself. “It’s so strange,” he says. “She thinks we pay developers just to type, instead of paying them to think and achieve business outcomes. And that means we pay them to learn, because that’s how we win. Can you imagine banning books from the workplace?” he says, laughing and shaking his head.
Maxine just stares at Kurt. Sarah’s beliefs are like the antithesis of the Third Ideal of Improvement of Daily Work and the Fourth Ideal of Psychological Safety. Maxine knows that the only way they could have achieved what they had was by creating a culture where people felt safe to experiment, to learn and make mistakes, and where people make time for discovery, innovation, and learning.
“No argument from me, Kurt. Let me know if you convince her,” Maxine says smiling, waving goodbye. “Happy Thanksgiving.”
Maxine has a fantastic Thanksgiving. It’s the first since her father died, and she enjoys having everyone over, even if she is surreptitiously looking at her phone all the time to see how the Black Friday preparations are going.
The highlight of Thanksgiving is when Waffles, now not so little, tipping the scale at forty pounds, grabs a big piece of turkey off the table in front of everyone, to Maxine’s horror. It was the first time he had ever done that, Jake promises everyone.
Everyone pitches in to clean up after, and Maxine goes to bed early.
She needs to be in the office early the next morning.
At three thirty a.m. she’s in the office with the rest of the team. The technical teams had been going through their launch checklist, getting ready for the surge in demand that would start in a couple of hours. They grab another conference room for the extended teams who can’t fit in the first. It’s a larger affair than the one percent test they ran on Tuesday. Each conference room has a similar configuration of big, U-shaped tables with about thirty people seated. She starts her day in the room where the technical teams are assembled.
In the extended war room are the Narwhal and Orca teams, next to the monitoring team, the web front-end teams, the mobile teams, and the numerous back-end service teams responsible for products, pricing, ordering, and fulfillment. There are many more technical teams on standby in the chatroom.
All of these services have to run seamlessly for products to be presented to a customer and for orders to be placed. On the huge TV monitor on the wall are more technical graphs showing the number of visits to the website, stats on the top product pages, as well as health checks and most recent errors from all the services represented in the room.
In the primary war room, they’ve set up a second TV where some of these technical metrics are displayed. And today, they have more representatives from business and technology leadership, the entire Unicorn and Promotions team, and even people from Finance and Accounting. Everyone who matters is here to see how the campaign goes.
At four thirty a.m. Maxine is hanging out with Kurt and Maggie in the primary war room. She is looking for something to help with, but everyone seems to know what they’re doing. At this point, all she can do is get in the way. They are thirty minutes away from the beginning of the campaign launch.
Sarah is here too. As far as Maxine can tell, she appears to be haranguing someone about the pricing and promotional copy for one of the offers.
Maggie is also in the huddle, not looking happy, saying, “Look, I know we want the offers to be perfect, but the time to make changes was yesterday. The risk of making changes in the copy is just too high for something going out to so many people. It might delay the launch by another hour.”
“This may be good enough for you, but it’s certainly not good enough for me. Get this fixed. Now,” Sarah says, eliminating any further discussion.
Maggie sighs and walks away, rejoining Kurt and Maxine. “We’re going to have to make some changes,” she says, rolling her eyes. “Undoubtedly, this is going to push back the launch by at least an hour.”
“I’ll go tell the technical teams next door,” Kurt says, grimacing as he leaves the room.
An hour later, things are again finally ready to go. Maggie asks from the front of the room, “If there are no objections, let’s launch at six a.m. That’s fifteen minutes from now.”
When the launch begins, Maxine is in the business war room watching the large TV monitor like everyone else. Within two minutes, over ten thousand people have hit the website and are going through the order funnel, and the rates of arrivals keep climbing. And again, all the CPU loads start climbing, much higher than in the test launch.
People clap as the number of completed orders passes five hundred. Maxine is amazed at the scale of customers who are being mobilized by this launch.
She holds her breath, hoping that all their hard work hardening their systems will make this launch boring. She watches as the number of orders continues to climb … until they flatline, just like on Tuesday.
“Dammit, dammit,” Maxine mutters. Something is definitely going wrong again. And in the same portion of the order funnel. Something is preventing people from the shopping cart checkout.
Wes hollers out, “Someone tell me what’s going wrong with the shopping cart! Who has any relevant data or error messages?”
Shannon is the first to speak up again. Maxine marvels at Shannon’s uncanny ability to be first on scene. “Web shopping cart is generating an error. Fulfillment options aren’t being shown! I’m guessing some fulfillment service is failing. Posting screenshot in the chatroom.”
Someone from the other side of the room hollers out, “iOS mobile app crashing again.” Wes swears. The mobile app Dev manager swears.
Suddenly, Maxine tunes everything out, because in that moment, she’s suddenly afraid that maybe Data Hub is causing the problem. She’s still trying to think this through when she hears someone from the mobile team holler out: “Wes! The app just crashed after I hit the checkout button, right when it should have presented all the transaction details. I think a call to a back-end service is timing out. I thought we fixed all the places where that can happen, but we obviously missed one. We’re trying to figure out which service call is causing the problem.”
“Could that be a call to Data Hub?” Maxine whispers to Tom.
“Not sure,” Tom says, thinking. “I don’t think there’s any direct calls from the mobile app to us …”
On her laptop, Maxine pulls up the logs from the production Data Hub service, looking for anything unusual, grateful that she can do this herself now. She sees a couple of incoming order events, which generate four outgoing calls to other business systems. They all appear to be succeeding.
Seeing nothing, she turns her attention back to the front of the noisy room where Wes, Kurt, and Chris are convening. Seeing that they’re actively in discussion, Maxine joins them. She hears Wes ask, “… so what service is failing?”
Chris and Kurt pow-wow for a bit, and Wes apparently loses patience. He turns to the entire room and hollers over all commotion, “Listen up, everyone! Something in the transactio
n path between bringing up the shopping cart and completing an order is failing. Maxine, what are the names of each of these transactions and service calls?”
Although she is surprised at being prompted, she quickly rattles off eleven API calls and services off the top of her head. Brent calls out three more. “Thank you, Maxine and Brent,” Wes says.
Turning to the room, he hollers, “Okay, everyone, prove to me that each one of those services are working!”
Minutes later, they discover the problem. When a customer views the shopping cart, they are presented with the order details, payment options, and shipping options. When all that is correct, the customer hits the place order button.
Apparently, when displaying this page in the mobile app and on the web, a call is made to a back-end service to determine which shipping options are available based on their location, such as next-day air and ground shipment, as well as providers such as UPS and FedEx.
This service calls out to a bunch of external APIs from the shipping providers, and some of those are failing. Brent suspects that they are being rate-limited by one of them, because they’d never had Parts Unlimited servers send so many queries like this before.
Maxine can’t believe that a service that seems so trivial is jeopardizing the entire launch. She smiles and makes a note of this, because she knows that this will likely be the new normal. But for something this mission-critical, there’s no way we should depend on external services, she thinks. We need to gracefully handle the case when they’re down or when they cut us off.
Maxine joins the technology team leaders huddling in the front of the room. She suggests, “When we get shipping API failures, maybe we present just the ground shipment option. We know that this type of shipping is always available … Thoughts?”
The fulfillment service team lead nods, and they quickly work through the details with Wes and Maggie. They decide that, effective immediately, if they can’t get information from all shippers, they’ll just present ground shipment as the only option.