Book Read Free

Info We Trust

Page 13

by R J Andrews


  WILLIAM PLAYFAIR, 1801

  It is true that we are painfully limited creatures. We are unable to fully appreciate ratio comparisons. We only detect familiar patterns. Clues that are there may be too subtle for us to see. Their signal is too weak to our naked eyes. But, also remember that we are superior tool makers. Difficult cases are to be expected, and can be greeted with a smile. It is our opportunity to help the data show us more. Through a variety of methods, we can bring hidden comparisons and patterns into the light.

  CHAPTER

  9

  CREATE TO EXPLORE

  When the human realm seems doomed to heaviness, I feel the need to fly like Perseus into some other space. I am not talking about escaping into dreams or into the irrational. I mean that I feel the need to change my approach, to look at the world from a different angle, with different logic, different methods of knowing and proving.

  ITALO CALVINO, 1993

  Sometimes, the first data sketches we make sing, “Hey explorer, look over here!” But we should not always expect to see meaningful patterns at first glance. Data often holds more than first visions reveal. Data needs our help to probe its murky depths. Operations—analogous to counting, subtraction, and multiplication—can help produce more visual meat for our eyes to bite into.

  Profile

  The search for patterns can be augmented. Layered guides, like mini-maps to the data, help bring features of interest to our attention. We may be tempted to hijack any of the following devices as standalone summaries of the data, but it is far too early for that. Profiles of data, just like profiles of people, conceal essential details. We are still wearing our explorer hats and digging for treasure. No data point should be hidden. For now, each profile is an accessory guide that will help us supercharge our inspection.

  Histogram was coined by Karl Pearson in 1891 from historical diagram because it is a simplified, or diagrammatic, view of what has happened

  Profiles show how data is distributed. They help make sense of where marks are located. We already had a first encounter with the histogram profile via Tukey's stem-and-leaf plot. Skillful design of density-summarizing histograms comes down to playful bin sizing. There is no best bin width, just different aspects of the data that can be revealed. The histogram gives a rough sense of the shape of the data by showing relative frequency. Is the data symmetrical? Is a center of mass detectable? Is the data lopsided? Are there any unexpected humps? Perhaps these are not yet precise quantitative characteristics. But, taking a visual inventory of a histogram, like you would the profile of a trail ridge before a hike, can give a good sense of what kind of journey is ahead.

  Histograms reduce information in the data… choose interval based on tolerable loss of accuracy.

  WILLIAM CLEVELAND, 1985

  The mountainous bins of the histogram can be traded for a more numeric view of the data's center, symmetry, and spread. The box plot stakes its heart at the data's median, boxes the middle 50 percent, extends its whiskers to a chosen wider range, and shows whatever outliers remain outside. The box plot delivers focus, but its abstraction loses the shape of the histogram. Is it possible to get the best of both worlds? A violin plot mirrors the curves of the histogram and smooths its bins into an organic shape Stradivari might approve. A heatmap shades cells to indicate aggregate density.

  Modal means a local maximum, or peak. Unimodal means just one peak. The normal distribution is a special unimodal distribution. Distributions may also be bimodal or multimodal.

  Visual profiles provide a sense of the distribution of our data: where it is, where it is not, and where we might need to focus attention. Profiles augment our ability to see the data, but only in one dimension at a time. To probe in 2-D requires something more.

  The hinge diagram takes only the key numbers from the box plot: the minimum, median, maximum, and hinge data points just inside the 25th and 75th percentiles that define the middle box. Together these five points create an efficient summary of the distribution. For some applications, and with practice, perhaps the tiny is all you need.

  Translate

  Recall that alphabetic order was developed by scholars to catalog work for easier retrieval. Today, we miss out on what an information-based order may reveal if we resign to being governed by first letters, or any other default. Groups, voids, and other patterns may appear once we begin moving categories about in search of better forms. Jacques Bertin showed how reordering “simplifies the images without diminishing the number of observed correspondences.” We are free to reorder alphabetic category labels any way we please. But, what can be done with numeric data that has a built-in ordered and connective structure?

  Reorder the grid categories to reveal:

  We previously considered five different ways to show time using different 2-D plots of the same data. Here is a portion of that same data, with a single, straight summary line that fits the entire scatter of marks. It is a fine way to summarize the overall trend, but the fit does not have much to offer regarding what happens along the way.

  Linear regression minimizes the total vertical distance between the data points and bestfit line, often by minimizing the sum of squared distance, emphasizing the impact of data points far away from the fit.

  If we truly want to uncover the comparisons that data offers, then it might help to focus our vision explicitly on those differences. Consider a basketball scoreboard. Whether it reads 109–107, or 72–70, the accumulated points total is not interesting in the last moments of the game. It is the two-point spread that makes the final possession by the trailing team exciting. The residual is the difference between a data point and a summary value. It helps elevate variation to our eye. Here, the same data is chained and its median is highlighted. Then, the median is subtracted from each data value to find the residual at each point. The new scale focuses our attention on variation.

  DATA – MEDIAN RESIDUAL

  An alternative to the straight line fit is the smooth. The smooth threads a single curve through data according to a calculated running average. The smooth can be augmented too. The distance between each data point and the smooth is a particular type of residual Tukey called the rough. Whatever the summary line, consider separating it out in order to take a more focused look at the variation about it. Subtracting summaries from data to see the residual is just the beginning of how simple arithmetic can help us explore. Many data behaviors vary with regular time intervals. Some change naturally, like temperature across the year. Other behavior, like weekday rush hours, vary depending on culture. Distinguishing regular cycles and long-term trends can help us understand if a small-scale fluctuation is expected or worth more investigation. Noisy data can be decomposed to its seasonality to show influencing cycles and spotlight unaccounted-for variation: Data equals trend plus season plus residual. It is all done using the same additive spirit of showing the residual.

  Not everything is a straight line.

  JOHN TUKEY, 1977

  CO2 readings, in parts per million, across the last five years at Mauna Loa, Hawaii.

  The overall trend is separated by finding a linear fit to the data.

  Notice how the following vertical scales change. Seasonality, the pattern that is the same every year after accounting for the trend, is determined by averaging monthly readings.

  The residual is what is left from the data after subtracting the trend and seasonality.

  What if instead of translating the data, we move the entire plane? The connected scatter plot is one of my favorite charts. It reserves both axes for performant variables by connecting marks through time. Unfortunately, this places one of the two performant variables on the horizontal axis, where it is easily mistaken for time. We could skip the hassle of correcting this confusion and remove the x-axis entirely. An unconventional design rotates the entire graphic so time goes right.

  A changed approach is precisely the goal for the journey ahead: to discover new ways of seeing, to open spaces for possibilities,
and to find “fresh methods” for animating and awakening.

  NICK SOUSANIS, 2015

  Reordering and accentuating differences with translation helps us notice more with our eyes. A single image can tell a lot, but it cannot tell all. Just as a telescope cannot help you diagnose a fractured bone, each view of the data can only reveal a slice of reality. Data exploration is an iterative and playful dialog between your curiosity and your data. Playtime is about to get fun.

  Transform

  Translations, which use addition and subtraction to focus our eye, can only take us so far. Many of the comparisons our data have to offer are of the multiplicative nature. Ratios, like the rocket thrust comparison, describe A is this many times bigger than B. Reshuffling categories, removing medians, or rotating planes do not help make more sense of these abstract comparisons.

  The entire problem is one of augmenting this natural intelligence…

  JACQUES BERTIN, 1967

  Most of modern statistics is built around data that are normally distributed—if not normal, at least symmetric. But data don't always arrive on our doorstep like that.

  HOWARD WAINER, 2017

  In Greek mythology, Procrustes was a rogue who mutilated travelers. He stretched them, or cut off their legs, until they perfectly fit the dimensions of his iron bed. Procrustean sadism is a fitting metaphor for transformations that torture insights out of data. You see, the number lines that help us illustrate our data can likewise be stretched and compressed in pursuit of more clues.

  Scales define data worlds. They are very often number lines along horizontal and vertical axes. Recall how these perpendicular rulers mimic our own daily experience. We walk about the surface of the earth inside our own personal compass roses. We cannot change the laws of our world, but we can control the rules that govern the land of our data. It is time to stretch beyond the default scales data arrive with, and start playing with the shape of data worlds.

  In Russian formalism study of narratology, the fabula is the chronological raw material of a story and syuzhet is the way the story is organized.

  Think of data as the content and the number lines we use to plot data as the form. The content-form pair is analogous to water and the vessel that holds it. The same content looks different in different containers. The next technique shows you how to manipulate form to create a better view of the data.

  If the way the numbers were gathered… does not make them easy to grasp [then] we should change them.

  JOHN TUKEY, 1977

  Suppose you position a few data points across the number line and notice that an outlier causes some of the marks to bunch up, or overplot. Sketched here are the five most common elements in the human body, by weight in kilograms. A person contains about 43 kilograms of oxygen (O). This outlier causes two of the less abundant elements, nitrogen (N) and calcium (Ca), to overlap. The gap between the top outlier, O, and the bottom cluster is so large it could be expressed as a ratio. This difference will persist as long as we retain the native scale. This axis is our number line; it obeys our command. We can change the scale if we apply the same mathematical operation across all of the data. Since we have a top-tail outlier, we need a transformation that shrinks large numbers more than it shrinks small ones. Let us reposition the marks by their square root (√x). Now, oxygen is still separated, but not at the visual expense of its peers.

  See calcium (Ca) and nitrogen (N) overlap at the lower end.

  A second scale highlights the perfect squares of the original abundance to help make the visual connection to the square root transformation, below.

  Element marks are repositioned according to their square roots, removing the overlap between Ca and N.

  The square root is a good transformation, for this small dataset. But maybe a different warping effect would be better? Why, we could raise our data to a power (x2, x3, …) or take its inverse (1/x), or try some combination (like 1/√x). You see, once we entertain the possibility of transforming our scales, the effort threatens to spiral into a multiverse of possibilities. How can we rein in this power before we lose control?

  It is an unusual data set indeed that yields its secrets more readily when it is left untransformed.

  HOWARD WAINER, 2005

  Overlapping marks, empty voids, and ratio comparisons can all hinder our ability to see the data before us. However ugly the data arrives, the ladder of transformation can help find better forms for visual exploration. If your data (y) has outliers on its lower end, a bottom-tail, then try walking the data up the ladder: y → y2 → y3. This sequence of powers impacts larger numbers more and helps pick apart clusters at the top end of your distribution. If the data has a top-tail, outliers like the 43 kg of oxygen, try walking the data down the ladder. The Procrustean ladder lets us warp the number line world in pursuit of seeing what the data is hiding.

  If we want to learn more we must think more.

  JOHN TUKEY, 1977

  If one of our biggest problems as humans is grappling with ratio comparisons, then what we need is a system of “ratio-numbers.” This system would express ratios as differences, the kind of comparison we like. That way, we would be able to consume ratios in a more visually discernible way. Lucky for us, this “ratio-number” already exists. In fact, it has been helping us for over 400 years.

  Left-skewed “negative” bottom-tails can be moved up the ladder

  Right-skewed “positive” top-tails can be moved down.

  The ratio-number, or logarithm, has a tricky technical definition. How it works can prevent us from appreciating why it works: Logs were invented to transform ratios to differences. Too often, we throw a logarithm at a scale because we know it will compress a wide range of numbers to a narrower field. This may motivate you to try a log transformation, but it is only a partial victory if you miss the magic of what happens. For that, let's plot a short series of doubling numbers: 8, 16, 32, 64. Each step rises by 100 percent, a constant increase in ratio from one pair of numbers to the next. Above, the visual distance between these numbers is emphasized with filled boxes. Along the horizontal, the distance between each number doubles, because the number doubles. Along the vertical, we plot the logarithm of each number. The vertical visual difference is the same, because the ratio is the same. The world of the logarithm is built for comparing ratios. This is just what is needed to address troublesome comparisons that are so hard for our mind to make any sense of.

  Logarithm is from the Greek words logos (ratio) and arithmos (number, also the root for arithmetic: the art of counting). John Napier published a set of logarithms in 1614 and they evolved to Leonhard Euler's 1748 standard definition.

  The log's quotient property states directly that the ratio of division becomes the difference of subtraction: log(x/y) = log(x) – log(y)

  The example transformation below expands the original set of elements of the human body to several dozen. Now, elements that have only a trace abundance, like lithium and arsenic, are included too. Each element is represented by a dot, which is enough detail to show a ladder full of transformations. The most abundant element, oxygen, is again in the rightmost position. It is kept fixed there at every rung of the ladder. The 37th least abundant element (molybdenum) is kept fixed at the leftmost position. See how oxygen causes most of the elements to overlap on the blue w-rung (I chose w to represent weight). Transformations reshape how the distribution appears. Different views reveal different aspects of the distribution. It appears the log(w) transform de-clusters best.

  The whole canvas often gets consumed to show a narrow diagonal band of data. Empty corners are wasted real estate once we account for the trend. Data translation, transformation, and logarithms can combine to serve our comparison-seeking eyes. Here we compare the abundance (percent by weight) of the seven most common chemical elements in people and plants. The diagonal represents an equal relationship between people and plants. We can trace our eye down the diagonal to see that we share with plants the same sequence of most abu
ndant elements: oxygen (O), carbon (C), hydrogen (H), and so on. Notice how the even 1:1 diagonal relationship leaves a lot of the canvas rectangle empty.

  The same shape persists no matter what logarithm base (10, e, or 2) is used.

  Augment human capabilities rather than replace people.

  TAMARA MUNZER, 2014

  These differences can be visually exaggerated if we rotate and expand the data so that it fills all the available space. This is done with Tukey's sum-difference graph. By re-plotting the elements on an adjusted frame of reference, we can accentuate visual comparisons across the data. This gives our eyes more meat to dig into, affording a better shot at seeing what is going on. The log is one special rung on the ladder of transformation. We plot data because we want to explore it with our eyes, and the ladder makes that possible. The picture changes but the data remain the same.

 

‹ Prev