An Elegant Puzzle- Systems of Engineering Management

Page 6

by Will Larson

One potential exception is when you’re using a baseline as a contract with a second party, possibly specifying an SLO,16 at which point you’ll probably want to discuss it more explicitly than other baselines: missing an SLO will probably require immediate reprioritization, whereas missing most other baselines can generally be addressed more methodically.

From OKRs17 onward, there are dozens of different approaches to setting metrics, but I’ve found this format to be a useful, lightweight structure to start from. If folks have found other approaches easier or more useful, I’d love to hear from you!

3.5 Guiding broad organizational change with metrics

Although people often talk about goals and metrics18 when they’re writing new plans or reflecting on past plans, my fondest memories of metrics are when I’ve seen them used to drive large organizational change. In particular, I’ve found metrics to be an extremely effective way to lead change with little or no organizational authority, and I wanted to write up how I’ve seen that work.

At both Stripe and Uber, I’ve had the opportunity to manage infrastructure costs. (Let me insert a plug for Ryan Lopopolo’s amazing blog post on “Effectively Using AWS Reserved Instances.”19) Folks who haven’t thought about this problem often default to viewing it as boring, but I’ve found that as you dig into it, it’s rich soil for learning about leading organizational change.

It’s also a good example of how to lead change with metrics!

Infrastructure cost is a great example of a baseline metric.20 When you’re asked to take responsibility for a company’s overall infrastructure costs, you’re going to start from a goal along the lines of “Maintain infrastructure costs at their current percentage of net revenue of 30 percent.” (That percentage is a fictional number for this example’s purposes, since the percentage will depend on your industry and maturity, but I have found that tying it against net revenue is more useful than pinning it at a specific dollar amount.)

From there, the approach that I’ve found effective is:

Explore: The first step is to get data in an explorable format in your data warehouse, an SQL database, or even an Excel spreadsheet. Once there, spend time looking through it and getting a feel for it. Your goal in this phase is to identify where the levers for change are. For example, you might find that your batch pipeline is the majority of your costs and that your data warehouse is surprisingly cheap, which will allow you to focus further efforts.

Dive: Once you know the three or four major contributors, go deep on understanding those areas and the levers that drive them. Batch costs might be sensitive to number of jobs, total data stored, or new product development, or it might be entirely driven by a couple of expensive jobs. Diving deep helps you build a mental model, and it also kicks off a relationship between you and the teams who you’ll want to partner with most closely.

Attribute: For most company-level metrics (cost, latency, development velocity, etc.), the first step of diving will uncover one team who are nominally accountable for the metric’s performance, but they are typically a cloak. When you pull that cloak aside, that team’s performance is actually driven by dozens of other teams. For example, you might have a Cloud engineering team who are accountable for provisioning VMs, but they’re not the folks writing the code that runs on those VMs. It’s easy to simply pass the cost metric on to that Cloud team, but that’s just abdicating responsibility to them. What’s more useful is to help them build a system of second-degree attribution, allowing you to build data around the teams using the platform. This second degree of attribution is going to allow you to target the folks who can make an impact in the next step.

Contextualize: Armed with the attribution data, start to build context around each team’s performance. The most general and self-managing tool for this is benchmarking. It’s one thing for a team to know that they’re spending $100,000 a month, and it’s something entirely different for them to know that they’re spending $100,000 a month and that their team spends the second-highest amount out of 47 teams. Benchmarking is particularly powerful because it automatically adapts to changes in behavior. In some cases, benchmarking against all teams might be too coarse, and it may be useful to benchmark against a small handful of cohorts. For example, you might want to define cohorts for front-end, back-end, and infrastructure teams, given that they’ll have very different cost profiles.

Nudge: Once you’ve built context around the data so that folks can interpret it, the next step is to start nudging them to action! Dashboards are very powerful for analysis, but the challenge for baseline metrics is that folks shouldn’t need to think about them the vast majority of the time, and that can lead to them forgetting about the baselines entirely. What I’ve found effective is to send push notifications, typically email, to teams whose metric has changed recently, both in terms of absolute change and in terms of their benchmarked performance against their cohort. This ensures that each time you push information to a team, it includes important information that they should act on! What’s so powerful about nudges is that simply letting folks know their behavior has changed will typically stir them to action, and it doesn’t require any sort of organizational authority to do so. (For more on this topic, take a look at Nudge by Richard H. Thaler and Cass R. Sunstein.)21

Baseline: In the best case, you’ll be able to drive the organizational impact you need with contextualized nudges, but in some cases that isn’t quite enough. The next step is to work with the key teams to agree on baseline metrics for their performance. This is useful because it ensures that the baselines are top-of-mind, and it also gives them a powerful tool for negotiating priorities with their stakeholders. In some cases, this does require some organizational authority, but I’ve found that folks universally want to be responsible. As long as you can find time to sit down with the key teams and explain why the goal is important, it typically doesn’t require much organizational authority.

Review: The final phase, which hopefully you won’t need to reach, is running a monthly or quarterly review that looks at each team’s performance, and reaching out to teams to advocate for prioritization if they aren’t sustaining their agreed-upon baselines. This typically requires an executive sponsor, because teams who aren’t hitting their baselines are almost always being prioritized against other goals, and they your help explaining to their stakeholders why the change is important.

I’ve seen this approach work and, more importantly, I’ve found it to be very scalable. It enables a company to concurrently maintain many baseline metrics without overloading its teams. This is largely because this approach focuses on driving targeted change within the key drivers, only requiring involvement from a small subset of teams for any given metric. The approach is also effective because it tries to minimize top-down orchestration in favor of providing information to encourage teams themselves to adjust priorities.

3.6 Migrations: the sole scalable fix to tech debt

The most interesting migration I ever participated in was Uber’s migration from Puppet-managed services to a fully self-service provisioning model in which any engineer at the company could spin up a new service in two clicks. Not only could they, they did, provisioning multiple services each day by the time the service was complete, and every newly hired engineer could spin up a service from scratch on their first day.

Figure 3.5

Stages of a technical migration.

What made this migration so interesting was the volume. When we started, provisioning a new service took about two weeks of clock time and about two days of engineering time, and we were falling further behind each day. At the time, this was more than just a little stressful, but it was also a perfect laboratory to learn how to run large-scale software migrations: the transition was large enough to see even small shifts, and long enough that we got to experiment with a number of approaches.

Migrations are both essential and frustratingly frequent as your codebase ages and your business grows: most tools and proc
esses only support about one order of magnitude of growth22 before becoming ineffective, so rapid growth makes migrations a way of life. This isn’t because you have bad processes or poor tools—quite the opposite. The fact that something stops working at significantly increased scale is a sign that it was designed appropriately to the previous constraints rather than being over-designed.23

As a result, you switch tools a lot, and your ability to migrate to new software can easily become the defining constraint for your overall velocity. Given their importance, we don’t talk about running migrations very often; let’s remedy that!

3.6.1 Why migrations matter

Migrations matter because they are usually the only available avenue to make meaningful progress on technical debt.

Engineers hate technical debt. If there is an easy project that they can personally do to reduce tech debt, they’ll take it on themselves. Engineering managers hate technical debt, too. If there is an easy project that their team can execute in isolation, they’ll get it scheduled. In aggregate, this leads to a dynamic in which there is very little low-hanging fruit to reduce technical debt, and most remaining options require many teams working together to implement them. The result: migrations.

Each migration aims to create technical leverage (Your indexes no longer have to fit on a single server!) or reduce technical debt (Your acknowledged writes are guaranteed to persist a master failover!). They occupy the awkward territory of reduced immediate contribution today in exchange for more capacity tomorrow. This makes migrations controversial to schedule, and as your systems become larger, they become more expensive. Lore tells us that Googlers have a phrase, “running to stand still,” to describe a team whose entire capacity is consumed in upgrading dependencies and patterns, such that the group can’t make forward progress on the product/system they own. Spending all your time on migrations is extreme, but every midsize company has a long queue of migrations that it can’t staff: moving from VMs to containers, rolling out circuit-breaking, moving to the new build tool . . . the list extends effortlessly into the sunset.

Migrations are the only mechanism to effectively manage technical debt as your company and code grow. If you don’t get effective at software and system migrations, you’ll end up languishing in technical debt. (And you’ll still have to do one later anyway, it’s just that it’ll probably be a full rewrite.)

3.6.2 Running good migrations

The good news is that while migrations are hard, there is a pretty standard playbook that works remarkably well: de-risk, enable, then finish.

• De-risk

The first phase of a migration is de-risking it, and to do so as quickly and cheaply as possible. Write a design document and shop it with the teams that you believe will have the hardest time migrating. Iterate. Shop it with teams who have atypical patterns and edge cases. Iterate. Test it against the next six to twelve months of roadmap. Iterate.

After you’ve evolved the design, the next step is to embed into the most challenging one or two teams, and work side by side with those teams to build, evolve, and migrate to the new system. Don’t start with the easiest migrations, which can lead to a false sense of security.

Effective de-risking is essential, because each team who endorses a migration is making a bet on you that you’re going to get this damn thing done, and not leave them with a migration to an abandoned system that they have to revert to. If you leave one migration partially finished, people will be exceedingly suspicious of participating in the next.

• Enable

Once you’ve validated the solution that solves the intended problem, it’s time to start sharpening your tools. Many folks start migrations by generating tracking tickets for teams to implement, but it’s better to slow down and build tooling to programmatically migrate the easy 90 percent.24 This radically reduces the migration’s cost to the broader organization, which increases the organization’s success rate and creates more future opportunities to migrate.

Once you’ve handled as much of the migration programmatically as possible, figure out the self-service tooling and documentation that you can provide to allow teams to make the necessary changes without getting stuck. The best migration tools are incremental and reversible: folks should be able to immediately return to previous behavior if something goes wrong, and they should have the necessary expressiveness to de-risk their particular migration path.

Documentation and self-service tooling are products, and they thrive under the same regime: sit down with some teams and watch them follow your instructions, then improve them. Find another team. Repeat. Spending an extra two days intentionally making your documentation clean and your tools intuitive can save years in large migrations. Do it!

• Finish

The last phase of a migration is deprecating the legacy system that you’ve replaced. This requires getting to 100 percent adoption, and that can be quite challenging.

Start by stopping the bleeding, which is ensuring that all newly written code uses the new approach. That can be installing a ratchet in your linters,25 or updating your documentation and self-service tooling. This is always the first step, because it turns time into your friend. Instead of falling behind by default, you’re now making progress by default.

Okay, now you should start generating tracking tickets, and set in place a mechanism which pushes migration status to teams that need to migrate and to the general management structure. It’s important to give wider management context around migrations because the managers are the people who need to prioritize the migrations: if a team isn’t working on a migration, it’s typically because their leadership has not prioritized it.

At this point, you’re pretty close to complete, but you have the long tail of weird or unstaffed work. Your tool now is: finish it yourself. It’s not necessarily fun, but getting to 100 percent is going to require the team leading the migration to dig into the nooks and crannies themselves.

My final tip for finishing migrations centers around recognition. It’s important to celebrate migrations while they’re ongoing, but the majority of the celebration and recognition should be reserved for their successful completion. In particular, starting but not finishing migrations often incurs significant technical debt, so your incentives and recognition structure should be careful to avoid perverse incentives.

3.7 Running an engineering reorg

I believe that, at quickly growing companies, there are two managerial skills that have a disproportionate impact on your organization’s success: making technical migrations cheap, and running clean reorganizations. Do both well, and you can skip that lovely running-to-stand-still sensation, and invest your attention more fruitfully.

Of the two, managing organizational change is more general, so let’s work through a lightly structured framework for (re)designing an engineering organization.

Caveat: this is more of a thinking tool than a recipe!

My approach for planning organization change:

Validate that organizational change is the right tool.

Project head count a year out.

Set target ratio of management to individual contributors.

Identify logical teams and groups of teams.

Plan staffing for the teams and groups.

Commit to moving forward.

Roll out the change.

Now, let’s drill into each of those a bit.

Figure 3.6

Refactoring organizations as they grow.

3.7.1 Is a reorg the right tool?

There are two best kinds of reorganizations:

The one that solves a structural problem.

The one that you don’t do.

There is only one worst kind of reorg: the one you do because you’re avoiding a people management issue.

My checklist for ensuring that a reorganization is appropriate:

Is the problem structural? Organization change offers the opportunity to increase communication, reduce decision friction, an
d focus attention; if you’re looking for a different change, consider if there’s a more direct approach.

Are you reorganizing to work around a broken relationship? Management is a profession where karma always comes due, and you’ll be better off addressing the underlying issue than continuing to work around it.

Does the problem already exist? It’s better to wait until a problem actively exists before solving it, because it’s remarkably hard to predict future problems. Even if you’re right that the problem will occur, you may end up hitting a different problem first.

Are the conditions temporary? Are you in a major crunch period or otherwise doing something you don’t anticipate doing again? If so, then it may be easier to patch through and rethink on the other side, and avoid optimizing for a transient failure mode.

All right, so you’re still thinking that you want a reorg.

3.7.2 Project head count a year out

The first step of designing the organization is determining its approximate total size. I recommend reasoning through this number from three or four different directions:

1. An optimistic number based on what’s barely possible.

‹ Prev Next ›