An Elegant Puzzle- Systems of Engineering Management

Page 21

by Will Larson

From the abstract:

Complexity is the single major difficulty in the successful development of large-scale software systems. Following Brooks we distinguish accidental from essential difficulty, but disagree with his premise that most complexity remaining in contemporary systems is essential. We identify common causes of complexity and discuss general approaches which can be taken to eliminate them where they are accidental in nature. To make things more concrete we then give an outline for a potential complexity-minimizing approach based on functional programming and Codd’s relational model of data.

The paper’s certainly a good read, although reading it a decade later, it’s fascinating to see that neither of those approaches have particularly taken off. Instead the closest “universal” approach to reducing complexity seems to be the move to numerous mostly stateless services. This is perhaps more a reduction of local complexity, at the expense of larger systemic complexity, whose maintenance is then delegated to more specialized systems engineers.

(This is yet another paper that makes me wish TLA+ felt natural enough to be a commonly adopted tool.)

“The Chubby Lock Service for Loosely-Coupled Distributed Systems”

Distributed systems are hard enough without having to frequently reimplement Paxos or Raft. The model proposed by Chubby is to implement consensus once in a shared service, which will allow systems built upon it to share in the resilience of distribution by following greatly simplified patterns.

From the abstract:

We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with advisory locks, but the design emphasis is on availability and reliability, as opposed to high performance. Many instances of the service have been used for over a year, with several of them each handling a few tens of thousands of clients concurrently. The paper describes the initial design and expected use, compares it with actual use, and explains how the design had to be modified to accommodate the differences.

In the open source world, the way Zookeeper is used in projects like Kafka and Mesos has the same role as Chubby.

“Bigtable: A Distributed Storage System for Structured Data”

One of Google’s preeminent papers and technologies is Bigtable, which was an early (early in the internet era, anyway) NoSQL data store, operating at extremely high scale and built on top of Chubby.

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

From the SSTable design to the bloom filters, Cassandra inherits significantly from the Bigtable paper, and is probably rightfully considered a merging of the Dynamo and Bigtable papers.

“Spanner: Google’s Globally-Distributed Database”

Many early NoSQL storage systems traded eventual consistency for increased resiliency, but building on top of eventually consistent systems can be harrowing. Spanner represents an approach from Google to offering both strong consistency and distributed reliability, based in part on a novel approach to managing time.

Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

We haven’t seen any open source Spanner equivalents yet, but I imagine we’ll start seeing them soon.

“Security Keys: Practical Cryptographic Second Factors for the Modern Web”

Security keys like the YubiKey18 have emerged as the most secure second authentication factor, and this paper out of Google explains the motivations that led to their creation, as well as the design that makes them work.

From the abstract:

Security Keys are second-factor devices that protect users against phishing and man-in-the-middle attacks. Users carry a single device and can self-register it with any online service that supports the protocol. The devices are simple to implement and deploy, simple to use, privacy preserving, and secure against strong attackers. We have shipped support for Security Keys in the Chrome web browser and in Google’s online services. We show that Security Keys lead to both an increased level of security and user satisfaction by analyzing a two-year deployment which began within Google and has extended to our consumer-facing web applications. The Security Key design has been standardized by the FIDO Alliance, an organization with more than 250 member companies spanning the industry. Currently, Security Keys have been deployed by Google, Dropbox, and GitHub.

These keys are also remarkably cheap! Order a few and start securing your life in a day or two.

“BeyondCorp: Design to Deployment at Google”

Building on the original BeyondCorp paper,19 which was published in 2014, this paper is slightly more detailed and benefits from two more years of migration-fueled wisdom. That said, the big ideas have remained fairly consistent, and there is not much new relative to the BeyondCorp paper itself. If you haven’t read that fantastic paper, this is an equally good starting point:

The goal of Google’s BeyondCorp initiative is to improve our security with regard to how employees and devices access internal applications. Unlike the conventional perimeter security model, BeyondCorp doesn’t gate access to services and tools based on a user’s physical location or the originating network; instead, access policies are based on information about a device, its state, and its associated user. BeyondCorp considers both internal networks and external networks to be completely untrusted, and gates access to applications by dynamically asserting and enforcing levels, or tiers, of access.

As is often the case when I read Google papers, my biggest takeaway here is to wonder when we’ll start to see reusable, pluggable open source versions of the techniques described within.

“Availability in Globally Distributed Storage Systems”

This paper explores how to think about availability in replicated distributed systems, and is a useful starting point for those of us who are trying to determine the correct way to measure uptime for our storage layer or for any other sufficiently complex system.

From the abstract:

We characterize the availability properties of cloud storage systems based on an extensive one-year study of Google’s main storage infrastructure and present statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compare data availability under a variety of system parameters given the real patterns of failures observed in our fleet.

Particularly interesting is the focus on correlated failures, building on the premise that users of distributed systems only experience the failure when multiple components have overlapping failures. Another expected but reassuring observation is that at Google’s scale (and with resources distributed across racks and regions), most failure comes from tunin
g and system design, not from the underlying hardware.

I was also surprised by how simple their definition of availability was in this case:

A storage node becomes unavailable when it fails to respond positively to periodic health checking pings sent by our monitoring system. The node remains unavailable until it regains responsiveness or the storage system reconstructs the data from other surviving nodes.

Often, discussions of availability become arbitrarily complex (“It should really be that response rates are over X, but with correct results and within our latency SLO!”), and it’s reassuring to see the simplest definitions are still usable.

“Still All on One Server: Perforce at Scale”

As a company grows, code hosting performance becomes one of the critical factors in overall developer productivity (along with build and test performance), but it’s a topic that isn’t discussed frequently. This paper from Google discusses their experience scaling Perforce:

Google runs the busiest single Perforce server on the planet, and one of the largest repositories in any source control system. From this high-water mark this paper looks at server performance and other issues of scale, with digressions into where we are, how we got here, and how we continue to stay one step ahead of our users.

This paper is particularly impressive when you consider the difficulties that companies run into as they scale Git monorepos (talk to an ex-Twitter employee near you for war stories).

“Large-Scale Automated Refactoring Using ClangMR”

Large codebases tend to age poorly, especially in the case of monorepos storing hundreds or thousands of different teams collaborating on different projects.

This paper covers one of Google’s attempts to reduce the burden of maintaining their large monorepo through tooling that makes it easy to rewrite abstract syntax trees (ASTs) across the entire codebase.

From the abstract:

In this paper, we present a real-world implementation of a system to refactor large C++ codebases efficiently. A combination of the Clang compiler framework and the MapReduce parallel processor, ClangMR enables code maintainers to easily and correctly transform large collections of code. We describe the motivation behind such a tool, its implementation and then present our experiences using it in a recent API update with Google’s C++ codebase.

Similar work is being done with Pivot.20

“Source Code Rejuvenation is not Refactoring”

This paper introduces the concept of “code rejuvenation,” a unidirectional process of moving toward cleaner abstractions as new language features and libraries become available, which is particularly applicable to sprawling, older codebases.

From the abstract:

In this paper, we present the notion of source code rejuvenation, the automated migration of legacy code and very briefly mention the tools we use to achieve that. While refactoring improves structurally inadequate source code, source code rejuvenation leverages enhanced program language and library facilities by finding and replacing coding patterns that can be expressed through higher-level software abstractions. Raising the level of abstraction benefits software maintainability, security, and performance.

There are some strong echoes of this work in Google’s ClangMR paper.21

“Searching for Build Debt: Experiences Managing Technical Debt at Google”

This paper is an interesting cover of how to perform large-scale migrations in living codebases. Using broken builds as the running example, they break down their strategy into three pillars: automating, making it easy to do the right thing, and making it hard to do the wrong thing.

From the abstract:

With a large and rapidly changing codebase, Google software engineers are constantly paying interest on various forms of technical debt. Google engineers also make efforts to pay down that debt, whether through special Fixit days, or via dedicated teams, variously known as janitors, cultivators, or demolition experts. We describe several related efforts to measure and pay down technical debt found in Google’s BUILD files and associated dead code. We address debt found in dependency specifications, unbuildable targets, and unnecessary command line flags. These efforts often expose other forms of technical debt that must first be managed.

“No Silver Bullet—Essence and Accident in Software Engineering”

A seminal paper from the author of The Mythical Man-Month, “No Silver Bullet” expands on discussions of accidental versus essential complexity, and argues that there is no longer enough accidental complexity to allow individual reductions in that accidental complexity to significantly increase engineer productivity.

From the abstract:

Most of the big past gains in software productivity have come from removing artificial barriers that have made the accidental tasks inordinately hard, such as severe hardware constraints, awkward programming languages, lack of machine time. How much of what software engineers now do is still devoted to the accidental, as opposed to the essential? Unless it is more than 9/10 of all effort, shrinking all the accidental activities to zero time will not give an order of magnitude improvement.

I think that, interestingly, we do see accidental complexity in large codebases become large enough to make order-of-magnitude improvements (motivating, for example, Google’s investments in ClangMR and such), so perhaps we’re not quite as far ahead in the shift to essential complexity as we’d like to believe.

“The UNIX Time-Sharing System”

This paper describes the fundamentals of UNIX as of 1974. What is truly remarkable is how many of the design decisions are still used today. From the permission model that we’ve all manipulated with chmod to system calls used to manipulate files, it’s amazing how much remains intact.

From the abstract:

UNIX is a general-purpose, multi-user, interactive operating system for the Digital Equipment Corporation PDP-11/40 and 11/45 computers. It offers a number of features seldom found even in larger operating systems, including: (1) a hierarchical file system incorporating demountable volumes; (2) compatible file, device, and interprocess I/O; (3) the ability to initiate asynchronous processes; (4) system command language selectable on a per-user basis; and (5) over 100 subsystems including a dozen languages. This paper discusses the nature and implementation of the file system and of the user command interface.

Also fascinating is their observation that UNIX has in part succeeded because it was designed to solve a general problem by its authors (working with the PDP-7 was frustrating), rather than to progress toward a more specified goal.

Endnotes

Chapter 2: Organizations

1. https://lethain.com/running-an-engineering-reorg/

2. https://lethain.com/strategies-visions/

3. https://lethain.com/guiding-broad-change-with-metrics/

4. https://increment.com/on-call/

5. https://lethain.com/durably-excellent-teams/

6. https://lethain.com/running-an-engineering-reorg/

7. http://www.askmar.com/BusinessPlan/2006-11-18YahooPeanutButterManifesto.pdf

8. https://lethain.com/durably-excellent-teams/

9. https://lethain.com/durably-excellent-teams/

10. https://www.amazon.com/dp/B002LHRM2O

11. https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557

12. https://hbr.org/2008/04/managing-hypergrowth

13. http://fortune.com/2015/03/15/bill-gurley-predicts-dead-unicorns-in-startup-land-this-year/

14. https://www.amazon.com/Mythical-Man-Month-Software-Engineering-Anniversary/dp/0201835959

15. https://www.amazon.com/dp/B00AZRBLHO/

16. https://lethain.com/tools-for-operating-a-growing-org/

17. https://lethain.com/identify-your-controls/

18. https://lethain.com/productivity-in-the-age-of-hypergrowth/

Chapter 3: Tools

1. https://lethain.com/building-technical-leverage/

2. https://lethain.com/intro-product-management/r />
3. https://www.amazon.com/Thinking-Systems-Donella-H-Meadows/dp/1603580557

4. https://lethain.com/accelerate-developer-productivity/

5. https://www.iseesystems.com/

6. https://insightmaker.com/

7. https://lethain.com/durably-excellent-teams/

8. https://lethain.com/product-management-infra-engineering/

9. https://www.amazon.com/dp/B004W3FM4A/ref=dp-kindle-redirect?_encoding=UTF8andbtkr=1

10. https://lethain.com/building-technical-leverage/

11. https://lethain.com/migrations/

12. https://lethain.com/migrations/

13. https://lethain.com/good-strategy-bad-strategy/

14. https://www.amazon.com/dp/B004J4WKEC/ref=dp-kindle-redirect?_encoding=UTF8andbtkr=1

‹ Prev Next ›