An Elegant Puzzle- Systems of Engineering Management

Page 20

by Will Larson

“On Designing and Deploying Internet-Scale Services”

We don’t always remember to consider Microsoft as one of the largest internet technology players—although, increasingly, Azure is making that comparison obvious and immediate—and it certainly wasn’t a name that necessarily came to mind in 2007. This excellent paper from James Hamilton, which explores tips on building operable systems at extremely large scale, makes it clear that not considering Microsoft as a large internet player was a lapse in our collective judgment.

From the abstract:

The system-to-administrator ratio is commonly used as a rough metric to understand administrative costs in high-scale services. With smaller, less automated services this ratio can be as low as 2:1, whereas on industry leading, highly automated services, we’ve seen ratios as high as 2,500:1. Within Microsoft services, Autopilot [1] is often cited as the magic behind the success of the Windows Live Search team in achieving high system-to-administrator ratios. While auto-administration is important, the most important factor is actually the service itself. Is the service efficient to automate? Is it what we refer to more generally as operations-friendly? Services that are operations-friendly require little human intervention, and both detect and recover from all but the most obscure failures without administrative intervention. This paper summarizes the best practices accumulated over many years in scaling some of the largest services at MSN and Windows Live.

This is a true checklist of how to design and evaluate large-scale systems (almost in the way that The Twelve-Factor App4 wants to serve as a checklist for operable applications).

“CAP Twelve Years Later: How the ‘Rules’ Have Changed”

Eric Brewer posited the CAP theorem5 in the early aughts, and 12 years later he wrote this excellent overview and review of CAP (which argues that distributed systems have to pick between either availability or consistency during partitions), Here’s the rationale for the paper, in Brewer’s words:

In the decade since its introduction, designers and researchers have used (and sometimes abused) the CAP theorem as a reason to explore a wide variety of novel distributed systems. The NoSQL movement also has applied it as an argument against traditional databases.

CAP is interesting because there is not a “seminal CAP paper,” but this article serves well in such a paper’s stead. These ideas are expanded on in the “Harvest and Yield” paper.6

“Harvest, Yield, and Scalable Tolerant Systems”

This paper builds on the concepts from CAP Twelve Years Later, introducing the concepts of harvest and yield to add more nuance to the discussion about AP versus CP.

The cost of reconciling consistency and state management with high availability is highly magnified by the unprecedented scale and robustness requirements of today’s internet applications. We propose two strategies for improving overall availability using simple mechanisms that scale over large applications whose output behavior tolerates graceful degradation. We characterize this degradation in terms of harvest and yield, and map it directly onto engineering mechanisms that enhance availability by improving fault isolation, and in some cases also simplify programming. By collecting examples of related techniques in the literature and illustrating the surprising range of applications that can benefit from these approaches, we hope to motivate a broader research program in this area.

The harvest and yield concepts are particularly interesting because they are both self-evident and very rarely explicitly used; instead, distributed systems continue to fail in mostly undefined ways. Hopefully, as we keep rereading this paper, we’ll also start to incorporate its design concepts into the systems we subsequently build!

“MapReduce: Simplified Data Processing on Large Clusters”

The MapReduce paper is an excellent example of an idea that has been so successful that it now seems self-evident. The idea of applying the concepts of functional programming at scale became a clarion call, provoking a shift from data warehousing to a new paradigm for data analysis:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real-world tasks are expressible in this model, as shown in the paper.

Much like the Google File System paper was an inspiration for the Hadoop File System, this paper was itself a major inspiration for Hadoop.

“Dapper, a Large-Scale Distributed Systems Tracing Infrastructure”

The Dapper paper introduces a performant approach to tracing requests across many services, which has become increasingly relevant as more companies refactor core monolithic applications into dozens or hundreds of micro-services.

From the abstract:

Here we introduce the design of Dapper, Google’s production distributed systems tracing infrastructure, and describe how our design goals of low overhead, application-level transparency, and ubiquitous deployment on a very large-scale system were met. Dapper shares conceptual similarities with other tracing systems, particularly Magpie and X-Trace, but certain design choices were made that have been key to its success in our environment, such as the use of sampling and restricting the instrumentation to a rather small number of common libraries.

The ideas from Dapper have since made their way into open source, especially in Zipkin7 and OpenTracing.8

“Kafka: a Distributed Messaging System for Log Processing”

Apache Kafka9 has become a core piece of infrastructure for many internet companies. Its versatility lends it to many roles, serving as the ingress point to “data land” for some and as a durable queue for others. And that’s just scratching the surface.

Kafka is not only a useful addition to your tool kit, it’s also a beautifully designed system:

Log processing has become a critical component of the data pipeline for consumer internet companies. We introduce Kafka, a distributed messaging system that we developed for collecting and delivering high volumes of log data with low latency. Our system incorporates ideas from existing log aggregators and messaging systems, and is suitable for both offline and online message consumption. We made quite a few unconventional yet practical design choices in Kafka to make our system efficient and scalable. Our experimental results show that Kafka has superior performance when compared to two popular messaging systems. We have been using Kafka in production for some time and it is processing hundreds of gigabytes of new data each day.

In particular, Kafka’s partitions do a phenomenal job of forcing application designers to make explicit decisions about trading off performance for predictable message ordering.

“Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services”

In many ways similar to Kafka, Facebook’s Wormhole is another highly scalable approach to messaging:

Wormhole is a publish-subscribe (pub-sub) system developed for use within Facebook’s geographically replicated datacenters. It is used to reliably replicate changes among several Facebook services including TAO, Graph Search, and Memcache. This paper describes the design and implementation of Wormhole as well as the operational challenges of scaling the system to support the multiple data storage systems deployed at Facebook. Our production deployment of Wormhole transfers over 35 GBytes/sec in steady state (50 millions messages/sec or 5 trillion messages/day) across all deployments with bursts up to 200 GBytes/sec during failure recovery. We demonstrate that Wormhole publishes updates with low latency to subscribers that can fail or consume updates at varying rates, without compromising efficiency.

In particular, note the approach to supporting lagging consumers without sacrificing overall system throughput.

“Borg, Omega, and Kubernetes”

While the individual papers for each of Google’s orchestration systems (Borg, Omega, and Kubernetes) are worth reading in their own right,
this article is an excellent overview of the three:

Though widespread interest in software containers is a relatively recent phenomenon, at Google we have been managing Linux containers at scale for more than ten years and built three different container-management systems in that time. Each system was heavily influenced by its predecessors, even though they were developed for different reasons. This article describes the lessons we’ve learned from developing and operating them.

Fortunately, not all orchestration happens under Google’s aegis, and Apache Mesos’s alternative two-layer scheduling architecture is a fascinating read as well.

“Large-Scale Cluster Management at Google with Borg”

Borg has orchestrated much of Google’s infrastructure for quite some time (significantly predating Omega, although, fascinatingly, the Omega paper predates the Borg paper by two years):

Google’s Borg system is a cluster manager that runs hundreds of thousands of jobs, from many thousands of different applications, across a number of clusters each with up to tens of thousands of machines.

This paper takes a look at Borg’s centralized scheduling model, which was both effective and efficient, although it became increasingly challenging to modify and scale over time. Borg inspired both Omega and Kubernetes within Google (the former to optimistically replace it, and the latter to seemingly commercialize the designers’ learnings, or at least to prevent Mesos from capturing too much mind share).

“Omega: Flexible, Scalable Schedulers for Large Compute Clusters”

Omega is, among many other things, an excellent example of the second-system effect,10 in which an attempt to replace a complex existing system with something far more elegant ends up being more challenging than anticipated.

In particular, Omega is a reaction against the realities of extending the aging Borg system:

Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control.

Perhaps it’s also an example of “worse is better”11 once again taking the day.

“Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center”

This paper describes the design of Apache Mesos,12 in particular its distinctive two-level scheduler:

We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Sharing improves cluster utilization and avoids per-framework data replication. Mesos shares resources in a fine-grained manner, allowing frameworks to achieve data locality by taking turns reading data stored on each machine. To support the sophisticated schedulers of today’s frameworks, Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them.

Our results show that Mesos can achieve near-optimal data locality when sharing the cluster among diverse frameworks, can scale to 50,000 (emulated) nodes, and is resilient to failures.

Used heavily by Twitter and Apple, Mesos was for some time the only open-source general scheduler with significant adoption. It’s now in a fascinating competition for mind share with Kubernetes.

“Design Patterns for Container-Based Distributed Systems”

The move to container-based deployment and orchestration has introduced a whole new set of vocabulary, including “sidecars” and “adapters.” This paper provides a survey of the patterns that have evolved over the past decade, as microservices and containers have become increasingly prominent infrastructure components:

In the late 1980s and early 1990s, object-oriented programming revolutionized software development, popularizing the approach of building of applications as collections of modular components. Today we are seeing a similar revolution in distributed system development, with the increasing popularity of microservice architectures built from containerized software components. Containers are particularly well-suited as the fundamental “object” in distributed systems by virtue of the walls they erect at the container boundary. As this architectural style matures, we are seeing the emergence of design patterns, much as we did for object-oriented programs, and for the same reason—thinking in terms of objects (or containers) abstracts away the low-level details of code, eventually revealing higher-level patterns that are common to a variety of applications and algorithms.

The term “sidecar” in particular, likely originated in this blog post from Netflix,13 which is a worthy read in its own right.

“Raft: In Search of an Understandable Consensus Algorithm”

We often see the second-system effect when a second system becomes bloated and complex relative to a simple initial system. However, the roles are reversed in the case of Paxos and Raft. Whereas Paxos is often considered beyond human comprehension, Raft is a fairly easy read:

Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election, log replication, and safety, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Results from a user study demonstrate that Raft is easier for students to learn than Paxos. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety.

Raft is used by etcd14 and influxDB15 among many others.

“Paxos Made Simple”

One of Leslie Lamport’s numerous influential papers, “Paxos Made Simple” is a gem, both in explaining the notoriously complex Paxos algorithm and because, even at its simplest, Paxos isn’t really that simple:

The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm—the “synod” algorithm. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems.

Paxos itself remains a deeply innovative concept, and is the algorithm behind Google’s Chubby and Apache Zookeeper,16 among many others.

“SWIM: Scalable Weakly-Consistent Infection-Style Process Group Membership Protocol”

The majority of consensus algorithms focus on being consistent during partition, but SWIM goes the other direction and focuses on availability:

Several distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes. SWIM is a generic software module that offers this service for large-scale process groups. The SWIM effort is motivated by the unscalability of traditional heart-beating protocols, which either impose network loads that grow quadratically with group size, or compromise response times or false positive frequency w.r.t. detecting process crashes. This paper reports on the design, implementation, and performance of the SWIM sub-system on a large cluster of commodity PCs.

SWIM is used in HashiCorp’s software, as well as Uber’s Ringpop.

“The Byzantine Generals Problem”

Another classi
c Leslie Lamport paper on consensus, the Byzantine Generals Problem explores how to deal with distributed actors that intentionally or accidentally submit incorrect messages:

Reliable computer systems must handle malfunctioning components that give conflicting information to different parts of the system. This situation can be expressed abstractly in terms of a group of generals of the Byzantine army camped with their troops around an enemy city. Communicating only by messenger, the generals must agree upon a common battle plan. However, one or more of them may be traitors who will try to confuse the others. The problem is to find an algorithm to ensure that the loyal generals will reach agreement. It is shown that, using only oral messages, this problem is solvable if and only if more than two-thirds of the generals are loyal; so a single traitor can confound two loyal generals. With unforgeable written messages, the problem is solvable for any number of generals and possible traitors. Applications of the solutions to reliable computer systems are then discussed.

The paper is mostly focused on the formal proof, a bit of a theme from Lamport, who developed TLA+17 to make formal proving easier, but it’s also a useful reminder that we still tend to assume our components will behave reliably and honestly, and perhaps we shouldn’t!

“Out of the Tar Pit”

“Out of the Tar Pit” bemoans unnecessary complexity in software, and proposes that functional programming and better data modeling can help us reduce accidental complexity. (The authors argue that most unnecessary complexity comes from state.)

‹ Prev Next ›