5
DATAFICATION
MATTHEW FONTAINE MAURY was a promising U.S. Navy officer headed to a new assignment on the brig Consort in 1839, when his stagecoach suddenly slid off its path, toppled over, and hurled him into the air. He landed hard, fracturing his thighbone and dislocating his knee. The joint was snapped back into place by a local doctor but the thigh was badly set and needed to be rebroken a few days later. The injuries left Maury, at 33 years old, partially crippled and unfit for the sea. After nearly three years of recuperation, the Navy placed him behind a desk, as the head of the uninspiringly named Depot of Charts and Instruments.
It turned out to be the perfect place for him. As a young navigator, Maury had been bewildered by the way ships would zigzag across the water rather than take more direct routes. When he quizzed captains about it, they replied it was far better to steer a familiar course than to risk a less known one that might entail hidden dangers. They viewed the ocean as an unpredictable realm, where sailors faced the unexpected with every wind and wave.
Yet from his voyages Maury knew that this wasn’t entirely true. He saw patterns everywhere. On an extended stop in Valparaiso, Chile, he witnessed the winds operating like clockwork. A late afternoon gale would abruptly end at sundown and become a gentle breeze, as if someone had turned off a tap. On another voyage he crossed the warm aqua-blue waters of the Gulf Stream as it flowed between the dark walls of Atlantic seawater, as distinguishable and fixed in place as if it were the Mississippi River. Indeed, the Portuguese had navigated the Atlantic for centuries by relying on uniform easterly and westerly winds called the “trades” (which in old English meant “path” or “track,” and only later became associated with commerce).
Whenever Midshipman Maury arrived at a new port, he would seek out old sea captains to gain their knowledge, based on the experiences passed down for generations. He learned about tides, winds, and sea currents that acted in regularity—but were nowhere to be found in the books and maps that the Navy issued to its sailors. Instead, they relied on charts that were sometimes a hundred years old, many with vast omissions or outright inaccuracies. In his new position as the Superintendent of the Depot of Charts and Instruments, he aimed to fix that.
Taking up the post, he inventoried the barometers, compasses, sextants, and chronometers in the depot’s collection. He also noted the myriad nautical books, maps, and charts that it housed. He found musty crates full of old logbooks from all the past voyages of Navy captains. His predecessors in the job had regarded them as rubbish. With the odd limerick or sketch in the margins, they sometimes seemed more of an escape from the boredom of the passage than a record of ships’ whereabouts.
But as Maury dusted off the saltwater-stained books and peered inside, he became very excited. Here was the information he needed: records about the wind, water, and weather at specific locations on specific dates. Though some of the logs offered little of value, many teemed with useful information. Put it all together, Maury realized, and an entirely new form of navigational chart would be possible. Maury and his dozen “computers”—the job title of those who calculated data—began the laborious process of extracting and tabulating the information trapped inside the deteriorating logs.
Maury aggregated the data and divvied up the entire Atlantic into blocks of five degrees of longitude and latitude. For each segment he noted the temperature, the speed and direction of the winds and waves, and also the month, since those conditions differ depending on the time of year. When combined, the data revealed patterns and pointed toward more efficient routes.
Generations of seafarers’ advice occasionally had sent ships directly into calms or pitted them against opposing winds and currents. On one common route, from New York to Rio de Janeiro, sailors had long tended to fight nature rather than rely on her. American skippers were taught to avoid the hazards of a straight cruise south to Rio. So their ships flitted about in a southeasterly course before swinging southwesterly after crossing the equator. The distance sailed often amounted to three complete crossings of the Atlantic. The convoluted route turned out to be nonsensical. A roughly direct shot south was fine.
To improve accuracy, Maury needed more information. He created a standard form for logging ships’ data and got all U.S. Navy vessels to use and submit it upon landing. Merchant ships desperately wanted to get hold of his charts; Maury insisted that in return they too hand over their logs (an early version of a viral social network). “Every ship that navigates the high seas,” he proclaimed, “may henceforth be regarded as a floating observatory, a temple of science.” To fine-tune the charts, he sought other data points (just as Google built upon the PageRank algorithm to include more signals). He got captains to throw bottles with notes indicating the day, position, wind, and prevailing current into the sea at regular intervals, and to retrieve any such bottles that they spotted. Many ships flew a special flag to show they were cooperating with the information exchange (presaging the link-sharing icons that appear on some web pages).
From the data, natural sea-lanes presented themselves, where the winds and currents were particularly favorable. Maury’s charts cut long voyages, usually by about a third, saving merchants a bundle. “Until I took up your work I had been traversing the ocean blindfold,” wrote one appreciative shipmaster. And even old sea dogs who rejected the newfangled charts and relied on the traditional ways or their intuition served a useful function: if their journeys took longer or met with disaster, they proved the utility of Maury’s system. By 1855, when he published his magisterial work The Physical Geography of the Sea, Maury had plotted 1.2 million data points. “Thus the young mariner instead of groping his way along until the lights of experience should come to him . . . would here find, at once, that he had already the experience of a thousand navigators to guide him,” he wrote.
His work was essential for laying the first transatlantic telegraph cable. And, after a tragic collision on the high seas, he quickly devised the system of shipping lanes that is commonplace today. He even applied his method to astronomy: when the planet Neptune was discovered in 1846, Maury had the bright idea of combing the archives for mistaken references to it as a star, which enabled its orbit to be plotted.
Maury has been largely ignored in American history books, perhaps because the Virginia native resigned from the (Union) Navy during the Civil War and served as a spy in England for the Confederacy. But years earlier, when he arrived in Europe to drum up international support for his charts, four countries knighted him and he received gold medals from another eight, including the Holy See. At the dawn of the twenty-first century, pilot charts issued by the U.S. Navy still bore his name.
Commander Maury, the “Pathfinder of the Seas,” was among the first to realize that there is a special value in a huge corpus of data that is lacking in smaller amounts—a core tenet of big data. More fundamentally, he understood that the Navy’s musty logbooks actually constituted “data” that could be extracted and tabulated. In so doing, he was one of the pioneers of datafication, of unearthing data from material that no one thought held any value. Like Oren Etzioni at Farecast, who used the airline industry’s old price information to create a lucrative business, or the engineers at Google, who applied old search queries to understand flu outbreaks, Maury took information generated for one purpose and converted it into something else.
His method, broadly similar to big-data techniques today, was astounding considering that it was done with pencil and paper. His story highlights the degree to which the use of data predates digitization. Today we tend to conflate the two, but it is important to keep them separate. To get a fuller sense of how data is being extracted from the unlikeliest of places, consider a more modern example.
Appreciating people’s posteriors is the art and science of Shigeomi Koshimizu, a professor at Japan’s Advanced Institute of Industrial Technology in Tokyo. Few would think that the way a person sits constitutes information, but it can. When a person is seated, the c
ontours of the body, posture, and distribution of weight can all be quantified and tabulated. Koshimizu and his team of engineers convert backsides into data by measuring the pressure at 360 different points from sensors in a car seat and indexing each point on a scale from zero to 256. The result is a digital code that is unique for each individual. In a trial, the system was able to distinguish among a handful of people with 98 percent accuracy.
The research is not asinine. The technology is being developed as an anti-theft system in cars. A vehicle equipped with it would recognize when someone other than an approved driver was at the wheel and demand a password to continue driving or perhaps cut the engine. Transforming sitting positions into data creates a viable service and a potentially lucrative business. And its usefulness may go far beyond deterring auto theft. For instance, the aggregated data might reveal clues about a relationship between drivers’ posture and road safety, such as telltale shifts in position prior to accidents. The system might also be able to sense when a driver slumps slightly from fatigue and send an alert or automatically apply the brakes. And it might not only prevent a car from being stolen but identify the thief from behind (so to speak).
Professor Koshimizu took something that had never been treated as data—or even imagined to have an informational quality—and transformed it into a numerically quantified format. Likewise, Commodore Maury took material that seemed to have little use and extracted the information, turning it into eminently useful data. Doing so allowed the information to be used in a novel way and to create unique value.
The word “data” means “given” in Latin, in the sense of a “fact.” It became the title of a classic work by Euclid, in which he explains geometry from what is known or can be shown to be known. Today data refers to a description of something that allows it to be recorded, analyzed, and reorganized. There is no good term yet for the sorts of transformations produced by Commodore Maury and Professor Koshimizu. So let’s call them datafication. To datafy a phenomenon is to put it in a quantified format so it can be tabulated and analyzed.
Again, this is very different from digitization, the process of converting analog information into the zeros and ones of binary code so computers can handle it. Digitization wasn’t the first thing we did with computers. The initial era of the computer revolution was computational, as the etymology of the word suggests. We used machines to do calculations that had taken a long time to do by previous methods: such as missile trajectory tables, censuses, and the weather. Only later came taking analog content and digitizing it. Hence when Nicholas Negroponte of the MIT Media Lab published his landmark book in 1995 called Being Digital, one of his big themes was the shift from atoms to bits. We largely digitized text in the 1990s. More recently, as storage capacity, processing power, and bandwidth have increased, we’ve done it with other forms of content too, like images, video, and music.
Today there is an implicit belief among technologists that big data traces its lineage to the silicon revolution. That simply is not so. Modern IT systems certainly make big data possible, but at its core the move to big data is a continuation of humankind’s ancient quest to measure, record, and analyze the world. The IT revolution is evident all around us, but the emphasis has mostly been on the T, the technology. It is time to recast our gaze to focus on the I, the information.
In order to capture quantifiable information, to datafy, we need to know how to measure and how to record what we measure. This requires the right set of tools. It also necessitates a desire to quantify and to record. Both are prerequisites of datafication, and we developed the building blocks necessary for datafication many centuries before the dawn of the digital age.
Quantifying the world
The ability to record information is one of the lines of demarcation between primitive and advanced societies. Basic counting and measurement of length and weight were among the oldest conceptual tools of early civilizations. By the third millennium B.C. the idea of recorded information had advanced significantly in the Indus Valley, Egypt, and Mesopotamia. Accuracy increased, as did the use of measurement in everyday life. The evolution of script in Mesopotamia provided a precise method of keeping track of production and business transactions. Written language enabled early civilizations to measure reality, record it, and retrieve it later. Together, measuring and recording facilitated the creation of data. They are the earliest foundations of datafication.
This made it possible to replicate human activity. Buildings, for example, could be reproduced from records of their dimensions and materials. It also permitted experimentation: an architect or a builder could alter certain dimensions while keeping others unchanged, creating a new design—which could then be recorded in turn. Commercial transactions could be captured, so one knew how much crop was produced from a harvest or a field (and how much would be taken away by the state in taxes). Quantification enabled prediction and thus planning, even if it was as crude as simply guessing that next year’s harvest would be as bountiful as the previous years’. It let partners in a transaction keep tabs on what they owed each other. Without measuring and recording, there could be no money, because there wouldn’t have been data to support it.
Over the centuries, measuring extended from length and weight to area, volume, and time. By the beginning of the first millennium A.D., the main features of measuring were in place in the West. But there was a significant shortcoming to the way early civilizations measured. It wasn’t optimized for calculations, even relatively simple ones. The counting system of Roman numerals was a poor fit for numerical analysis. Without a base-ten “positional” numbering system or decimals, multiplication and division of large numbers were hard even for experts, and simple addition and subtraction lacked transparency for most of the rest.
An alternative system of numerals was developed in India around the first century A.D. It traveled to Persia, where it was improved, and then was passed on to the Arabs, who greatly refined it. It is the basis of the Arabic numerals we use today. The Crusades may have brought destruction on the lands the Europeans invaded, but knowledge migrated from East to West, and perhaps the most significant transplant was Arabic numerals. Pope Sylvester II, who had studied them, advocated their use at the end of the first millennium. By the twelfth century Arabic texts describing the system were translated into Latin and spread throughout Europe. As a result, mathematics took off.
Even before Arabic numerals arrived in Europe, calculating had been improved through the use of counting boards. These were smooth trays on which tokens were placed to denote amounts. By sliding the tokens in certain areas, one could add or subtract. Yet the method had severe limitations. It was hard to calculate very large and very small numbers at the same time. Most important, the numbers on the boards were fleeting. A wrong move or a careless bump might change a digit, leading to incorrect results. Counting boards may have been tolerable for calculating, but they were bad for recording. And the only way to record and store the numbers shown on the boards was to translate them back into inefficient Roman numerals. (The Europeans were never exposed to the abacuses of the Orient—in hindsight a good thing, since the devices might have prolonged the use of Roman numerals in the West.)
Mathematics gave new meaning to data—it could now be analyzed, not just recorded and retrieved. Widespread adoption of Arabic numerals in Europe took hundreds of years, from their introduction in the twelfth century to the late sixteenth century. By that time, mathematicians boasted that they could calculate six times faster with Arabic numerals than with counting boards. What finally helped make Arabic numerals a success was the evolution of another tool of datafication: double-entry bookkeeping.
Accountants invented script in the third millennium B.C. While bookkeeping evolved over the centuries that followed, by and large it remained a system of recording a particular transaction in one place. What it failed to do was to show easily at any given time what bookkeepers and their merchant employers care about most: whether a particular acc
ount or an entire venture was profitable or not. That began to change in the fourteenth century, when accountants in Italy started recording transactions using two entries, one for credits and one for debits, so that overall the accounts are in balance. The beauty of this system was that it made it easy to see profits and losses. And suddenly dull data began to speak.
Today double-entry bookkeeping is usually considered only for its consequences for accounting and finance. But it also represents a landmark in the evolution of the use of data. It enabled information to be recorded in the form of “categories” that linked accounts. It worked by means of a set of rules about how to record data—one of the earliest examples of standardized recording of information. One accountant could look at another’s books and understand them. It was organized to make a particular type of data query—calculating profits or losses for each account—quick and straightforward. And it provided an audit trail of transactions so that the data was more easily retraceable. Technology geeks can appreciate it today: it had “error correction” built in as a design feature. If one side of the ledger looked amiss, one could check the corresponding entry.
Still, like Arabic numerals, double-entry bookkeeping was not an instant success. Two hundred years after this method had first been devised, it would take a mathematician and a merchant family to alter the history of datafication.
The mathematician was a Franciscan monk named Luca Pacioli. In 1494 he published a textbook, written for the layperson, on mathematics and its commercial application. The book was a great success and became the de facto mathematics textbook of its time. It was also the first book to use Arabic numerals throughout, and thus its popularity facilitated their adoption in Europe. Its most lasting contribution, however, was the section devoted to bookkeeping, where Pacioli neatly explained the double-entry system of accounting. Over the following decades the material on bookkeeping was separately published in six languages, and it remained the standard reference on the subject for centuries.
Big Data: A Revolution That Will Transform How We Live, Work, and Think Page 9