First, let's look at how the data might be stored in the simple, one-table approach we've been using so far in this chapter. This is shown in the top panel of the figure on the following page. As you can see, there are 10 rows in this database and 5 columns; a simple way of measuring the amount of information in the database is to say there are 10 × 5 = 50 data items in the database. Spend a few seconds now studying the top panel of the figure on the next page more closely. Is there anything that irks you about the way this data is stored? For instance, can you see any unnecessary repetition of data? Can you think of a more efficient way of storing the same information?
You probably realized that a lot of information about each course is duplicated for each student that takes the course. For example, three students take ARCH101, and the detailed information about this course (including its title, instructor, and room number) is repeated for each of the three students. A much more effective way of storing this information is to use two tables: one to store which courses are taken by which students, and another to store the details about each course. This two-table approach is shown in the bottom panel of the figure on the following page.
We can immediately see one of the advantages of this multitable approach: the total amount of storage required is reduced. This new approach uses one table with 10 rows and 2 columns (i.e., 10? 2 = 20 items), and a second table with 3 rows and 4 columns (i.e., 3 × 4 = 12 items), resulting in a total of 32 items. In contrast, the one-table approach needed 50 items to store exactly the same information.
How did this saving come about? It comes from the elimination of repeated information: instead of repeating the course title, instructor, and room number for each course taken by each student, this information is listed exactly once for each course. We have sacrificed something to achieve this, though: now the course numbers appear in two different places, since there is a “course number” column in both tables. So we have traded a large amount of repetition (of the course details) for a small amount of repetition (of the course numbers). Overall, this works out to be a good deal. The gains in this small example are not huge, but you can probably see that if there are hundreds of students taking each course, the storage savings from this approach would be enormous.
Top: Single-table database for students' courses.
Bottom: The same data stored more efficiently, in two tables.
There is another big advantage of the multitable approach. If the tables are designed correctly, then changes to the database can be made more easily. For example, suppose the room number for MATH314 has changed from 560 to 440. In the one-table approach (top of the figure on the previous page), four separate rows would need to be updated—and, as we discussed earlier, these four updates would need to be wrapped in a single transaction to ensure that the database remains consistent. But in the multitable approach (bottom of the figure on the facing page), only one change is required, updating a single entry in the table of course details.
Keys
It's worth pointing out here that, while this simple student-courses example is most efficiently represented using only two tables, real databases often incorporate many tables. It is easy to imagine extending our student-courses example with new tables. For example, there could be a table containing details for each student, such as a student ID number, phone number, and home address. There could be a table for each instructor, listing e-mail address, office location, and office hours. Each table is designed so that most of its columns store data that is not repeated anywhere else—the idea is that whenever details about a certain object are required, we can “look up” those details in the relevant table.
In database terminology, any column that is used to “look up” details in a table is called a key. For example, let's think about how we would find out the room number for Luigi's history class. Using the single-table approach of the upper panel of the figure on the previous page, we just scan the rows until we find Luigi's history class, look across to the room number column, and observe the answer, which in this case is 851. But in the multitable approach of the same figure's lower panel, we initially scan the first table to find the course number of Luigi's history class—this turns out to be “HIST256.” Then we use “HIST256” as a key in the other table: we look up the details for this course by finding the row containing “HIST256” as its course number, then move across that row to find the room number (again, 851). This process is shown in the figure on the following page.
The beauty of using keys like this is that databases can look up keys with superb efficiency. This is done in a similar fashion to the way a human looks up a word in a dictionary. Think about how you would go about finding the word “epistemology” in a printed dictionary. Naturally, you would not start at the first page and scan through every entry looking for “epistemology.” Instead, you quickly narrow in on the word by looking at the page headings, initially turning the pages in large clumps and gradually reverting to smaller clumps as you get close to your goal. Databases look up keys using the same technique, but they are even more efficient than humans. This is because the database can precalculate the “clumps” of pages that will be turned and keep a record of the headings at the start and end of each clump. A set of precalculated clumps for fast key lookup is known in computer science as a B-tree. The B-tree is yet another crucial and ingenious idea underpinning modern databases, but a detailed discussion of B-trees would, unfortunately, lead us too far afield.
Looking up data using a key: To find out the room number for Luigi's history course, we first find the relevant course number in the left-hand table. This value, “HIST256,” is then used as a key in the other table. Because the column of course numbers is sorted in alphabetical order, we can find the correct row very quickly, then obtain the corresponding room number (851).
The Virtual Table Trick
We are nearly ready to appreciate the main ingenious trick behind modern multitable databases. The basic idea is simple: although all of a database's information is stored in a fixed set of tables, a database can generate completely new, temporary tables whenever it needs to. We'll call these “virtual tables” to emphasize the fact that they are never really stored anywhere—the database creates them whenever they are needed to answer a query to the database and then immediately deletes them.
A simple example will demonstrate the virtual table trick. Suppose we start with the database of the lower panel of the figure on page 142, and a user enters a query asking for the names of all students taking classes from Professor Kirby. There are actually several different ways a database can proceed with this query; we'll just examine one of the possible approaches. The first step is to create a new virtual table listing students and instructors for all courses. This is done using a special database operation called a join of two tables. The basic idea is to combine each row of one table with each corresponding row of the other table, where the correspondence is established by a key column that appears in both tables. For example, when we join the two tables of the bottom panel of the figure on page 142 using the “course number” column as the key, the result is a virtual table exactly like the one in the figure's top panel—each student name is combined with all of the details for the relevant course from the second table, and these details are looked up using the “course number” as a key. Of course, the original query was about student names and instructors, so we don't need any of the other columns. Luckily, databases include a projection operation that lets us throw away columns we are not interested in. So after the join operation to combine the two tables, followed by a projection operation to eliminate some unnecessary columns, the database produces the following virtual table:
Next, the database uses another important operation called select. A select operation chooses some of the rows from a table, based on some criteria, and throws away the other rows, producing a new virtual table. In this case, we are looking for students who take courses from Professor Kirby, so we need to do a “select” operation that
chooses only rows in which the instructor is “Prof Kirby.” That leaves us with this virtual table:
The query is nearly completed. All we need now is another projection operation, to throw away the “instructor” column, leaving us with a virtual table that answers the original query:
It's worth adding a slightly more technical note here. If you happen to be familiar with the database query language SQL, you might find the above definition of the “select” operation rather strange, as the “select” command in SQL does much more than merely selecting some rows. The terminology here comes from a mathematical theory of database operations, known as relational algebra, in which “select” is used only for selecting rows. Relational algebra also includes the “join” and “project” operations that we used in our query to find Professor Kirby's students.
Relational Databases
A database that stores all of its data in interconnected tables such as the ones we have been using is called a relational database. Relational databases were advocated by the IBM researcher E. F. Codd in his extraordinarily influential 1970 paper, “A Relational Model of Data for Large Shared Data Banks.” Like many of the greatest ideas in science, relational databases seem simple in retrospect—but at the time, they represented a huge leap forward in the efficient storage and processing of information. It turns out that a mere handful of operations (such as the relational algebra operations “select,” “join,” and “project” we saw earlier) are sufficient to generate virtual tables that answer essentially any query to a relational database. So a relational database can store its data in tables that are structured for efficiency, and use the virtual table trick to answer queries that seemingly require the data to be in a different form.
That's why relational databases are used to support a large proportion of e-commerce activities. Whenever you buy something online, you are probably interacting with a slew of relational database tables storing information about products, customers, and individual purchases. In cyberspace, we are constantly surrounded by relational databases, often without even realizing it.
THE HUMAN SIDE OF DATABASES
To the casual observer, databases may well be the least exciting topic in this book. It's just hard to get excited about data storage. But under the covers, the ingenious ideas that make databases work tell a different story. Built out of hardware that can fail in the middle of any operation, databases nevertheless give us the efficiency and rocksolid dependability that we have come to expect from online banking and similar activities. The to-do list trick gives us atomic transactions, which enforce consistency even when thousands of customers are simultaneously interacting with a database. This immense level of concurrency, together with rapid query responses via the virtual table trick, make large databases efficient. The to-do list trick also guarantees consistency in the face of failures. When combined with the prepare-then-commit trick for replicated databases, we are left with iron-clad consistency and durability for our data.
The heroic triumph of databases over unreliable components, known by computer scientists as “fault-tolerance,” is the work of many researchers over many decades. But among the most important contributors was Jim Gray, a superb computer scientist who literally wrote the book on transaction processing. (The book is Transaction Processing: Concepts and Techniques, first published in 1992.) Sadly, Gray's career ended early: one day in 2007, he sailed his yacht out of San Francisco Bay, under the Golden Gate Bridge, and into the open ocean on a planned day trip to some nearby islands. No sign of Gray, or his boat, was ever seen again. In a heart-warming twist to this tragic story, Gray's many friends in the database community used his own tools in an effort to save him: freshly generated satellite imagery of the ocean near San Francisco was uploaded to a database so that friends and colleagues could search for any trace of the missing database pioneer. Unfortunately, the search was not successful, and the world of computer science was left without one of its leading luminaries.
9
Digital Signatures: Who Really Wrote This Software?
To show you how mistaken you are, and what an unfounded assumption yours is, I will lay before you a certificate…look at it! You may take it in your hand; it's no forgery.
—CHARLES DICKENS, A Tale of Two Cities
Of all the ideas we'll encounter in this book, the concept of a “digital signature” is perhaps the most paradoxical. The word “digital,” interpreted literally, means “consisting of a string of digits.” So, by definition, anything that is digital can be copied: to do so, just copy the digits one at a time. If you can read it, you can copy it! On the other hand, the whole point of a “signature” is that it can be read, but can't be copied (that is, forged) by anyone other than its author. How could it be possible to create a signature that is digital, yet can't be copied? In this chapter, we will discover the resolution of this intriguing paradox.
WHAT ARE DIGITAL SIGNATURES REALLY USED FOR?
It might seem unnecessary to ask the question: what are digital signatures used for? Surely, you might think, we can use them for the same kinds of things that paper signatures are used for: signing checks and other legal documents, such as the lease on an apartment. But if you think about it for a moment, you will realize that this isn't true. Whenever you make an online payment for something, whether by credit card or through an online banking system, do you provide any kind of signature? The answer is no. Typically, online credit card payments require no signature whatsoever. Online banking systems are a little different, because they require you to log in with a password that helps to verify your identity. But if you later make a payment during your online banking session, no signature of any kind is required.
Your computer checks digital signatures automatically. Top: The message my web browser displays when I attempt to download and run a program that has a valid digital signature. Bottom: The result of an invalid or missing digital signature.
What, then, are digital signatures used for in practice? The answer is the reverse of what you might first think: instead of you signing material that is sent to others, it is typically others who sign material before sending it to you. The reason you are probably not aware of this is that the digital signatures are verified automatically by your computer. For example, whenever you attempt to download and run a program, your web browser probably checks to see if the program has a digital signature and whether or not the signature is valid. Then it can display an appropriate warning, like the ones above.
As you can see, there are two possibilities. If the software has a valid signature (as in the top panel of the figure), the computer can tell you with complete confidence the name of the company that wrote the software. Of course, this doesn't guarantee that the software is safe, but at least you can make an informed decision based on the amount of trust you have in the company. On the other hand, if the signature is invalid or missing (as in the bottom panel of the figure), you have absolutely no reassurance about where the software really came from. Even if you thought you were downloading software from a reputable company, it's possible that a hacker somehow substituted some malicious software for the real thing. Alternatively, maybe the software was produced by an amateur who did not have the time or motivation to create a valid digital signature. It is up to you, the user, to decide whether you trust the software under these circumstances.
Although software-signing is the most obvious application of digital signatures, it is by no means the only one. In fact, your computer receives and verifies digital signatures surprisingly often, because some frequently used internet protocols employ digital signatures to verify the identity of the computers you are interacting with. For example, secure servers whose web addresses begin with “https” typically send your computer a digitally signed certificate before establishing a secure session. Digital signatures are also used to verify the authenticity of many software components, such as browser plugins. You have probably seen warning messages about such things while surfing the web.
<
br /> There is another type of online signature you may have encountered: some websites ask you to type your name as a signature in an online form. I sometimes have to do this when filling out an online recommendation letter for one of my students, for instance. This is not what a computer scientist means by a digital signature! Obviously, this kind of typed signature can be forged effortlessly, by anyone who knows your name. In this chapter, we will learn how to create a digital signature that cannot be forged.
PAPER SIGNATURES
Our explanation of digital signatures is going to be built up gradually, starting with the familiar situation of paper signatures and moving in small steps toward genuine digital signatures. So to start with, let's go back to a world with no computers at all. In this world, the only way to authenticate documents is with handwritten signatures on paper. Notice that in this scenario, a signed document can't be authenticated in isolation. For example, suppose you find a piece of paper that says “I promise to pay $100 to Francoise. Signed, Ravi”—just as shown above. How can you verify that Ravi really signed this document? The answer is that you need some trusted repository of signatures, where you can go and check that Ravi's signature is genuine. In the real world, institutions such as banks and government departments perform this role—they really do keep files storing the signatures of their customers, and these files can be physically checked if necessary. In our pretend scenario, let's imagine that a trusted institution called a “paper signature bank” keeps everyone's signature on file. A schematic example of a paper signature bank is shown above.
Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers Page 16