The Phoenix Project
Page 3
It’s not exactly like mission control in Apollo 13, but that’s how I explain it to my relatives.
When something hits the fan, you need all the various stakeholders and technology managers to communicate and coordinate until the problem is resolved. Like now. At the conference table, fifteen people are in the midst of a loud and heated discussion, huddled around one of the classic gray speakerphones that resembles a ufo.
Wes and Patty are sitting next to each other at the conference table, so I walk behind them to listen in. Wes leans back in his chair with his arms crossed over his stomach. They don’t get all the way across. At six feet three inches tall and over 250 pounds, he casts a shadow on most people. He seems to be in constant motion and has a reputation of saying whatever is on his mind.
Patty is the complete opposite. Where Wes is loud, outspoken, and shoots from the hip, Patty is thoughtful, analytical, and a stickler for processes and procedures. Where Wes is large, combative, and sometimes even quarrelsome, Patty is elfin, logical, and levelheaded. She has a reputation for loving processes more than people and is often in the position of trying to impose order on the chaos in it.
She’s the face of the entire it organization. When things go wrong in it, people call Patty. She’s our professional apologist, whether it’s services crashing, web pages taking too long to load, or, as in today’s case, missing or corrupted data.
They also call Patty when they need their work done—like upgrading a computer, changing a phone number, or deploying a new application. She does all of the scheduling, so people are always lobbying her to get their work done first. She’ll then hand it off to people who do the work. For the most part, they live in either my old group or in Wes’ group.
Wes pounds the table, saying, “Just get the vendor on the phone and tell them that unless they get a tech down here pronto, we’re going to the competition. We’re one of their largest customers! We should probably have abandoned that pile of crap by now, come to think of it.”
He looks around and jokes, “You know the saying, right? The way you can tell a vendor is lying is when their lips are moving.”
One of the engineers across from Wes says, “We have them on the phone right now. They say it’ll be at least four hours before their san field engineer is on-site.”
I frown. Why are they talking about the san? Storage area networks provide centralized storage to many of our most critical systems, so failures are typically global: It won’t be just one server that goes down; it’ll be hundreds of servers that go down all at once.
While Wes starts arguing with the engineer, I try to think. Nothing about this payroll run failure sounds like a san issue. Ann suggested that it was probably something in the timekeeping applications supporting each plant.
“But after we tried to rollback the san, it stopped serving data entirely,” another engineer says. “Then the display started displaying everything in kanji! Well, we think it was kanji. Whatever it was, we couldn’t make heads or tails of those little pictures. That’s when we knew we needed to get the vendor involved.”
Although I’m joining late, I’m convinced we’re totally on the wrong track.
I lean in to whisper to Wes and Patty, “Can I get a minute with you guys in private?”
Wes turns and, without giving me his full attention, says loudly, “Can’t it wait? In case you haven’t noticed, we’re in the middle of a huge issue here.”
I put my hand firmly on his shoulder. “Wes, this is really important. It’s about the payroll failure and concerns a conversation I just had with Steve Masters and Dick Landry.”
He looks surprised. Patty is already out of her chair. “Let’s use my office,” she says, leading the way.
Following Patty into her office, I see a photo on her wall of her daughter, who I’d guess is eleven years old. I’m amazed at how much she looks like Patty—fearless, incredibly smart, and formidable—in a way that is a bit scary in such a cute little girl.
In a gruff voice, Wes says, “Okay, Bill, what’s so important that you think is worth interrupting a Sev 1 outage in progress?”
That’s not a bad question. Severity 1 outages are serious business-impacting incidents that are so disruptive, we typically drop everything to resolve them. I take a deep breath. “I don’t know if you’ve heard, but Luke and Damon are no longer with the company. The official word is that they’ve decided to take some time off. More than that, I don’t know.”
The surprised expressions on their faces confirm my suspicions. They didn’t know. I quickly relate the events of the morning. Patty shakes her head, uttering a tsk-tsk in disapproval.
Wes looks angry. He worked with Damon for many years. His face reddens. “So now we’re supposed to take orders from you? Look, no offense, pal, but aren’t you a little out of your league? You’ve managed the midrange systems, which are basically antiques, for years. You created a nice little cushy job for yourself up there. And you know what? You have absolutely no idea how to run modern distributed systems—to you, the 1990s is still the future!
“Quite frankly,” he says, “I think your head would explode if you had to live with the relentless pace and complexity of what I deal with every day.”
I exhale, while counting to three. “You want to talk to Steve about how you want my job? Be my guest. Let’s get the business what they need first and make sure that everyone gets paid on time.”
Patty responds quickly, “I know you weren’t asking me, but I agree that the payroll incident needs to be our focus.” She pauses and then says, “I think Steve made a good choice. Congratulations, Bill. When can we talk about a bigger budget?”
I flash her a small smile and a nod of thanks, returning my gaze to Wes.
A couple moments go by, and expressions I can’t quite decipher cross his face. Finally he relents, “Yeah, fine. And I will take you up on your offer to talk to Steve. He’s got a lot of explaining to do.”
I nod. Thinking about my own experience with Steve, I genuinely wish Wes luck if he actually decides to have a showdown with him.
“Thank you for your support, guys. I appreciate it. Now, what do we know about the failure—or failures? What’s all this about some san upgrade yesterday? Are they related?”
“We don’t know,” Wes shakes his head. “We were trying to figure that out when you walked in. We were in the middle of a san firmware upgrade yesterday when the payroll run failed. Brent thought the san was corrupting data, so he suggested we back out the changes. It made sense to me, but as you know, they ended up bricking it.”
Up until now, I’ve only heard “bricking” something in reference to breaking something small, like when a cell phone update goes bad. Using it to refer to a million-dollar piece of equipment where all our irreplaceable corporate data are stored makes me feel physically ill.
Brent works for Wes. He’s always in the middle of the important projects that it is working on. I’ve worked with him many times. He’s definitely a smart guy but can be intimidating because of how much he knows. What makes it worse is that he’s right most of the time.
“You heard them,” Wes says, gesturing toward the conference table where the outage meeting continues unabated. “The san won’t boot, won’t serve data, and our guys can’t even read any of the error messages on the display because they’re in some weird language. Now we’ve got a bunch of databases down, including, of course, payroll.”
“To work the san issue, we had to pull Brent off of a Phoenix job we promised to get done for Sarah,” Patty says ominously. “There’s going to be hell to pay.”
“Uh-oh. What exactly did we promise her?” I ask, alarmed.
Sarah is the svp of Retail Operations, and she also works for Steve. She has an uncanny knack for blaming other people for her screwups, especially it people. For years, she’s been able to escape any sort of real accountability.
Although I’ve heard rumors that Steve is grooming her as his replacement, I’ve always discounted
that as being totally impossible. I’m certain that Steve can’t be blind to her machinations.
“Sarah heard from someone that we were late getting a bunch of virtual machines over to Chris,” she replies. “We dropped everything to get on it. That is, until we had to drop everything to fix the san.”
Chris Allers, our vp of Application Development, is responsible for developing the applications and code that the business needs, which then get turned over to us to operate and maintain. Chris’ life is currently dominated by Phoenix.
I scratch my head. As a company, we’ve made a huge investment in virtualization. Although it looks uncannily like the mainframe operating environment from the 1960s, virtualization changed the game in Wes’ world. Suddenly, you don’t have to manage thousands of physical servers anymore. They’re now logical instances inside of one big-iron server or maybe even residing somewhere in the cloud.
Building a new server is now a right-click inside of an application. Cabling? It’s now a configuration setting. But despite the promise that virtualization was going to solve all our problems, here we are—still late in delivering a virtual machine to Chris.
“If we need Brent to work the san issue, keep him there. I’ll handle Sarah,” I say. “But if the payroll failure was caused by the san, why didn’t we see more widespread outages and failures?”
“Sarah is definitely going to be one unhappy camper. You know, suddenly I don’t want your job anymore,” Wes says with a loud laugh. “Don’t get yourself fired on your first day. They’ll probably come for me next!”
Wes pauses to think. “You know, you have a good point about the san. Brent is working the issue right now. Let’s go to his desk and see what he thinks.”
Patty and I both nod. It’s a good idea. We need to establish an accurate timeline of relevant events. And so far, we’re basing everything on hearsay.
That doesn’t work for solving crimes, and it definitely doesn’t work for solving outages.
CHAPTER 3
• Tuesday, September 2
I follow Patty and Wes as they walk past the noc, into the sea of cubicles. We end up in a giant workspace created by combining six cubicles. A large table sits against one wall with a keyboard and four lcd monitors, like a Wall Street trading desk. There are piles of servers everywhere, all with blinking lights. Each portion of the desk is covered by more monitors, showing graphs, login windows, code editors, Word documents, and countless applications I don’t recognize.
Brent types away in a window, oblivious to everything around him. From his phone, I hear the noc conference line. He obviously doesn’t seem worried that the loud speakerphone might bother his neighbors.
“Hey, Brent. You got a minute?” Wes asks loudly, putting a hand on his shoulder.
“Can it wait?” Brent replies without even looking up. “I’m actually kind of busy right now. Working the san issue, you know?”
Wes grabs a chair. “Yeah, that’s what we’re here to talk about.”
When Brent turns around, Wes continues, “Tell me again about last night. What made you conclude that the san upgrade caused the payroll run failure?”
Brent rolls his eyes, “I was helping one of the san engineers perform the firmware upgrade after everybody went home. It took way longer than we thought—nothing went according to the tech note. It got pretty hairy, but we finally finished around seven o’clock.
“We rebooted the san, but then all the self-tests started failing. We worked it for about fifteen minutes, trying to figure out what went wrong. That’s when we got the e-mails about the payroll run failing. That’s when I said, ‘Game Over.’
“We were just too many versions behind. The san vendor probably never tested the upgrade path we were going down. I called you, telling you I wanted to pull the plug. When you gave me the nod, we started the rollback.
“That’s when the san crashed,” he says, slumping in his chair. “It not only took down payroll but a bunch of other servers, too.”
“We’ve been meaning to upgrade the san firmware for years, but we never got around to it,” Wes explains, turning to me. “We came close once, but then we couldn’t get a big enough maintenance window. Performance has been getting worse and worse, to the point where a bunch of critical apps were being impacted. So finally, last night, we decided to just bite the bullet and do the upgrade.”
I nod. Then, my phone rings.
It’s Ann, so I put her on speakerphone.
“As you suggested, we looked at the data we pulled from the payroll database yesterday. The last pay period was fine. But for this pay period, all the Social Security numbers for the factory hourlies are complete gibberish. And all their hours worked and wage fields are zeroes, too. No one has ever seen anything like this before.”
“Just one field is gibberish?” I ask, raising my eyebrows in surprise. “What do you mean by ‘gibberish’? What’s in the fields?”
She tries to describe what she’s seeing on her screen. “Well, they’re not numbers or letters. There’s some hearts and spades and some squiggly characters… And there’s a bunch of foreign characters with umlauts… And there are no spaces. Is that important?”
When Brent snickers as he hears Ann trying to read line noise aloud, I give him a stern glance. “I think we’ve got the picture,” I say. “This is a very important clue. Can you send the spreadsheet with the corrupted data to me?”
She agrees. “By the way, are a bunch of databases down now? That’s funny. They were up last night.”
Wes mutters something under his breath, silencing Brent before he can say anything.
“Umm, yes. We’re aware of the problem and we’re working it, too,” I deadpan.
When we hang up, I breathe a sigh of relief, taking a moment to thank whatever deity protects people who fight fires and fix outages.
“Only one field corrupted in the database? Come on, guys, that definitely doesn’t sound like a san failure.” I say. “Brent, what else was going on yesterday, besides the san upgrade, that could have caused the payroll run to fail?”
Brent slouches in his chair, spinning it around while he thinks. “Well, now that you mention it… A developer for the timekeeping application called me yesterday with a strange question about the database table structure. I was in the middle of working on that Phoenix test vm, so I gave him a really quick answer so I could get back to work. You don’t suppose he did something to break the app, do you?”
Wes turns quickly to the speakerphone dialed into the noc conference call that has been on this whole time and unmutes the phone. “Hey, guys, it’s Wes here. I’m with Brent and Patty, as well as with our new boss, Bill Palmer. Steve Masters has put him charge of all of it Ops. So listen up, guys.”
My desire for an orderly announcement of my new role seems less and less likely.
Wes continues, “Does anyone know anything about a developer making any changes to the timekeeping application in the factories? Brent says he got a call from someone who asked about changing some database tables.”
From the speakerphone, a voice pipes up, “Yeah, I was helping someone who was having some connectivity issues with the plants. I’m pretty sure he was a developer maintaining the timekeeping app. He was installing some security application that John needed to get up and running this week. I think his name was Max. I still have his contact information around here somewhere… He said he was going on vacation today, which is why the work was so urgent.”
Now we’re getting somewhere.
A developer jamming in an urgent change so he could go on vacation—possibly as part of some urgent project being driven by John Pesche, our Chief Information Security Officer.
Situations like this only reinforce my deep suspicion of developers: They’re often carelessly breaking things and then disappearing, leaving Operations to clean up the mess.
The only thing more dangerous than a developer is a developer conspiring with Security. The two working together gives us means, moti
ve, and opportunity.
I’m guessing our ciso probably strong-armed a Development manager to do something, which resulted in a developer doing something else, which broke the payroll run.
Information Security is always flashing their badges at people and making urgent demands, regardless of the consequences to the rest of the organization, which is why we don’t invite them to many meetings. The best way to make sure something doesn’t get done is to have them in the room.
They’re always coming up with a million reasons why anything we do will create a security hole that alien space-hackers will exploit to pillage our entire organization and steal all our code, intellectual property, credit card numbers, and pictures of our loved ones. These are potentially valid risks, but I often can’t connect the dots between their shrill, hysterical, and self-righteous demands and actually improving the defensibility of our environment.
“Okay, guys,” I say decisively. “The payroll run failure is like a crime scene and we’re Scotland Yard. The san is no longer a suspect, but unfortunately, we’ve accidentally maimed it during our investigation. Brent, you keep working on the injured san—obviously, we’ve got to get it up and running soon.
“Wes and Patty, our new persons of interest are Max and his manager,” I say. “Do whatever it takes to find them, detain them, and figure out what they did. I don’t care if Max is on vacation. I’m guessing he probably messed up something, and we need to fix it by 3 p.m.”
I think for a moment. “I’m going to find John. Either of you want to join me?”
Wes and Patty argue over who will help interrogate John. Patty says adamantly, “It should be me. I’ve been trying to keep John’s people in line for years. They never follow our process, and it always causes problems. I’d love to see Steve and Dick rake him over the coals for pulling a stunt like this.”