Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers

Page 14

by John MacCormick

Let's get back to Shannon. His seminal 1948 paper, among its many extraordinary contributions, included a description of one of the earliest compression techniques. An MIT professor, Robert Fano, had also discovered the technique at about the same time, and the approach is now known as Shannon-Fano coding. In fact, Shannon-Fano coding is a particular way of implementing the shorter-symbol trick described earlier in this chapter. As we shall soon see, Shannon-Fano coding was rapidly superseded by another algorithm, but the method is very effective and survives to this day as one of the optional compression methods in the ZIP file format.

Both Shannon and Fano were aware that although their approach was both practical and efficient, it was not the best possible: Shannon had proved mathematically that even better compression techniques must exist, but had not yet discovered how to achieve them. Meanwhile, Fano had started teaching a graduate course in information theory at MIT, and he posed the problem of achieving optimal compression as one of the options for a term paper in the course. Remarkably, one of his students solved the problem, producing a method that yields the best possible compression for each individual symbol. The student was David Huffman, and his technique—now known as Huffman coding—is another example of the shorter-symbol trick. Huffman coding remains a fundamental compression algorithm and is widely used in communication and data storage systems.

* * *

1The solution: the letter A repeated 251 times.

8

Databases: The Quest for Consistency

“Data! data! data!” he cried impatiently. “I can't make bricks without clay.”

—SHERLOCK HOLMES IN ARTHUR CONAN DOYLE'S The Adventure of the Copper Beeches

Consider the following arcane ritual. A person takes from a desk a specially printed pad of paper (known as a checkbook), writes some numbers on it, and adds a signature with a flourish. The person then tears the top sheet from the pad, puts it in an envelope, and sticks another piece of paper (known as a stamp) on the front of the envelope. Finally, the person carries the envelope outside and down the street, to a large box where the envelope is deposited.

Until the turn of the 21st century, this was the monthly ritual of anyone paying a bill: phone bills, electric bills, credit card bills, and so on. Since then, systems of online bill payment and online banking have evolved. The simplicity and convenience of these systems makes the previous paper-based method seem almost ludicrously laborious and inefficient by comparison.

What technologies have enabled this transformation? The most obvious answer is the arrival of the internet, without which online communication of any form would be impossible. Another crucial technology is public key cryptography, which we already discussed in chapter 4. Without public key crypto, sensitive financial information could not be securely transmitted over the internet. There is, however, at least one other technology that is essential for online transactions: the database. As computer users, most of us are blissfully unaware of it, but virtually all of our online transactions are processed using sophisticated database techniques, developed by computer scientists since the 1970s.

Databases address two major issues in transaction processing: efficiency and reliability. Databases provide efficiency through algorithms that permit thousands of customers to simultaneously conduct transactions without leading to any conflicts or inconsistencies. And databases provide reliability through algorithms that allow data to survive intact despite the failure of computer components like disk drives, which would usually lead to severe data loss. Online banking is a canonical example of an application that requires outstanding efficiency (to serve many customers at once without producing any errors or inconsistencies) and essentially perfect reliability. So to focus our discussions, we will often return to the example of online banking.

In this chapter, we will learn about three of the fundamental—and beautiful—ideas behind databases: write-ahead logging, two-phase commit, and relational databases. These ideas have led to the absolute dominance of database technology for storing certain types of important information. As usual, we'll try to focus on the core insight behind each of these ideas, identifying a single trick that makes it work. Write-ahead logging boils down to the “to-do list trick,” which is tackled first. Then we move on to the two-phase commit protocol, described here via the simple but powerful “prepare-then-commit trick.” Finally, we will take a peek into the world of relational databases by learning about the “virtual table trick.”

But before learning any of these tricks, let's try to clear up the mystery of what a database actually is. In fact, even in the technical computer science literature, the word “database” can mean a lot of different things, so it is impossible to give a single, correct definition. But most experts would agree that the key property of databases, the one that distinguishes them from other ways of storing information, is that the information in a database has a predefined structure.

To understand what “structure” means here, let's first look at its opposite—an example of unstructured information:

Rosina is 35, and she's friends with Matt, who is 26. Jingyi is 37 and Sudeep is 31. Matt, Jingyi, and Sudeep are all friends with each other.

This is exactly the type of information that a social networking site, like Facebook or MySpace, would need to store about its members. But, of course, the information would not be stored in this unstructured way. Here's the same information in a structured form:

Computer scientists call this type of structure a table. Each row of the table contains information about a single thing (in this case, a person). Each column of the table contains a particular type of information, such as a person's age or name. A database often consists of many tables, but our initial examples will keep things simple and use only a single table.

Obviously, it is vastly more efficient for humans and computers alike to manipulate data in the structured form of a table, rather than the unstructured free text in the example above. But databases have much more going for them than mere ease of use.

Our journey into the world of databases begins with a new concept: consistency. As we will soon discover, database practitioners are obsessed with consistency—and with good reason. In simple terms, “consistency” means that the information in the database doesn't contradict itself. If there is a contradiction in the database, we have the worst nightmare of the database administrator: inconsistency. But how could an inconsistency arise in the first place? Well, imagine that the first two rows in the table above were changed slightly, giving:

Can you spot the problem here? According to the first row, Rosina is friends with Jingyi. But according to the second row, Jingyi is not friends with Rosina. This violates the basic notion of friendship, which is that two people are simultaneously friends with each other. Admittedly, this is a rather benign example of inconsistency.

To imagine a more serious case, suppose that the concept of “friendship” is replaced with “marriage.” Then we would end up with A married to B, but B married to C—a situation that is actually illegal in many countries.

Actually, this type of inconsistency is easy to avoid when new data is added to the database. Computers are great at following rules, so it's easy to set up a database to follow the rule “If A is married to B, then B must be married to A.” If someone tries to enter a new row that violates this rule, they will receive an error message and the entry will fail. So ensuring consistency based on simple rules doesn't require any clever trick.

But there are other types of inconsistency that require much more ingenious solutions. We'll look at one of these next.

TRANSACTIONS AND THE TO-DO LIST TRICK

Transactions are probably the most important idea in the world of databases. But to understand what they are, and why they are necessary, we need to accept two facts about computers. The first fact is one that you are probably all too familiar with: computer programs crash—and when a program crashes, it forgets everything it was doing. Only information that w
as explicitly saved to the computer's file system is preserved. The second fact we need to know is rather obscure, but extremely important: computer storage devices, such as hard drives and flash memory sticks, can write only a small amount of data instantaneously—typically about 500 characters. (If you're interested in technical jargon, I'm referring here to the sector size of a hard disk, which is typically 512 bytes. With flash memory, the relevant quantity is the page size, which may also be hundreds or thousands of bytes.) As computer users, we never notice this small size limit for instantaneously storing data on a device, because modern drives can execute hundreds of thousands of these 500-character writes every second. But the fact remains that the disk's contents get changed only a few hundred characters at a time.

What on earth does this have to do with databases? It has an extremely important consequence: typically, the computer can update only a single row of a database at any one time. Unfortunately, the very small and simple example above doesn't really demonstrate this. The entire table above contains less than 200 characters, so in this particular case, it would be possible for the computer to update two rows at once. But in general, for a database of any reasonable size, altering two different rows does require two separate disk operations.

With these background facts established, we can get to the heart of the matter. It turns out that many seemingly simple changes to a database require two or more rows to be altered. And as we now know, altering two different rows cannot be achieved in a single disk operation, so the database update will result in some sequence of two or more disk operations. But the computer can crash at any time. What will happen if the computer crashes between two of these disk operations? The computer can be rebooted, but it will have forgotten about any operations it was planning to perform, so it's possible that some of the necessary changes were never made. In other words, the database might be left in an inconsistent state!

At this stage, the whole problem of inconsistency after a crash might seem rather academic, so we'll look at two examples of this extremely important problem. Let's start with an even simpler database than the one above, say:

This very dull and depressing database lists three lonely people. Now suppose Rosina and Jingyi become friends, and we would like to update the database to reflect this happy event. As you can see, this update will require changes to both the first and second rows of the table—and as we discussed earlier, this will generally require two separate disk operations. Let's suppose that row 1 happens to get updated first. Immediately after that update, and before the computer has had a chance to execute the second disk operation that will update row 2, the database will look like this:

So far, so good. Now the database program just needs to update row 2, and it will be done. But wait: what if the computer crashes before it gets a chance to do that? Then after the computer has restarted, it will have no idea that row 2 still needs to be updated. The database will be left exactly as printed above: Rosina is friends with Jingyi, but Jingyi is not friends with Rosina. This is the dreaded inconsistency.

I already mentioned that database practitioners are obsessed with consistency, but at this point it may not seem like such a big deal. After all, does it really matter if Jingyi is recorded as being a friend in one place and friendless in another place? We could even imagine an automated tool that scans through the database every so often, looking for discrepancies like this and fixing them. In fact, tools like this do exist and can be used in databases where consistency is of secondary importance. You may have even encountered an example of this yourself, because some operating systems, when rebooted after a crash, check the entire file system for inconsistencies.

But there do exist situations in which an inconsistency is genuinely harmful and cannot be corrected by an automated tool. A classic example is the case of transferring money between bank accounts. Here's another simple database:

Suppose Zadie has requested to transfer $200 from her checking account to her savings account. Just as in the previous example, this is going to require two rows to be updated, using a sequence of two separate disk operations. First, Zadie's checking balance will be reduced to $600, then her savings balance will be increased to $500. And if we are unlucky enough to experience a crash between these two operations, the database will look like this:

In other words, this is a complete disaster for Zadie: before the crash, Zadie had a total of $1100 in her two accounts, but now she has only $900. She never withdrew any money—but somehow, $200 has completely vanished! And there is no way to detect this, because the database is perfectly self-consistent after the crash. We have encountered a much more subtle type of inconsistency here: the new database is inconsistent with its state before the crash.

It's worth investigating this important point in more detail. In our first example of inconsistency, we ended up with a database that was self-evidently inconsistent: A friends with B, but B not friends with A. This type of inconsistency can be detected merely by examining the database (although the detection process could be very time-consuming, if the database contains millions—or even billions—of records). In our second example of inconsistency, the database was left in a state that is perfectly plausible, when considered as a snapshot taken at a particular time. There is no rule that states what the balances of the accounts must be, or any relationships between those balances. Nevertheless, we can observe inconsistent behavior if we examine the state of the database over time. Three facts are pertinent here: (i) before initiating her transfer, Zadie had $1100; (ii) after the crash, she had $900; (iii) in the intervening period, she did not withdraw any money. Taken together, these three facts are inconsistent, but the inconsistency cannot be detected by examining the database at a particular point in time.

To avoid both types of inconsistency, database researchers came up with the concept of a “transaction”—a set of changes to a database that must all take place if the database is to be left consistent. If some, but not all, of the changes in a transaction are performed, then the database might be left inconsistent. This is a simple but extremely powerful idea. A database programmer can issue a command like “begin transaction,” then make a bunch of interdependent changes to the database, and finish with “end transaction.” The database will guarantee that the programmer's changes will all be accomplished, even if the computer running the database crashes and restarts in the middle of the transaction.

To be absolutely correct, we should be aware that there is another possibility too: it's possible that after a crash and restart, the database will return to the exact state it was in before the transaction began. But if this happens, the programmer will receive a notification that the transaction failed and must be resubmitted—so no harm is done. We'll be discussing this possibility in greater detail later, in the section about “rolling back” transactions. But for now, the crucial point is that the database remains consistent regardless of whether a transaction is completed or rolled back.

From the description so far, it may seem that we are obsessing unnecessarily over the possibility of crashes, which are, after all, very rare on modern operating systems running modern application programs. There are two responses to this. First, the notion of “crash” as it applies here is rather general: it encompasses any incident that might cause the computer to stop functioning and thus lose data. The possibilities include power failure, disk failure, other hardware malfunctions, and bugs in the operating system or application programs. Second, even if these generalized crashes are rather rare, some databases cannot afford to take the risk: banks, insurance companies, and any other organization whose data represents actual money cannot afford inconsistency in their records, under any circumstances.

The simplicity of the solution described above (begin a transaction, perform as many operations as necessary, then end the transaction) might sound too good to be true. In fact, it can be achieved with the relatively simple “to-do list” trick described next.

The To-Do List Trickr />
Not all of us are lucky enough to be well organized. But whether or not we are well organized ourselves, we've all seen one of the great weapons wielded by highly organized people: the “to-do” list. Perhaps you are not a fan of making lists yourself, but it's hard to argue with their usefulness. If you have 10 errands to get done in one day, then writing them down—preferably in an efficient ordering—makes for a very good start. A to-do list is especially useful if you get distracted (or, shall we say, “crash”?) in the middle of the day. If you forget your remaining errands for any reason, a quick glance at the list will remind you of them.

Database transactions are achieved using a special kind of to-do list. That's why we'll call it the “to-do list” trick, although computer scientists use the term “write-ahead logging” for the same idea. The basic idea is to maintain a log of actions the database is planning to take. The log is stored on a hard drive or some other permanent storage, so information in the log will survive crashes and restarts. Before any of the actions in a given transaction are performed, they are all recorded in the log and thus saved to the disk. If the transaction completes successfully, we can save some space by deleting the transaction's to-do list from the log. So Zadie's money-transfer transaction described above would take place in two main steps. First, the database table is left untouched and we write the transaction's to-do list in the log:

After ensuring the log entries have been saved to some permanent storage such as a disk, we make the planned changes to the table itself:

Assuming the changes have been saved to disk, the log entries can now be deleted.

But that was the easy case. What if the computer crashes unexpectedly in the middle of the transaction? As before, let's assume the crash occurs after Zadie's checking account has been debited, but before her savings account is credited. The computer reboots and the database restarts, finding the following information on the hard drive:

‹ Prev Next ›