Analog Science Fiction and Fact - July-Agust 2014

Page 32

by Penny Publications

He flipped off the com.

He stared into the blackness.

For a long time, he was alone.

* * *

Spanking Bad Data Won't Make Them Behave

Science Fact Michael F. Flynn | 4039 words

"A Fact has no 'why.' There it stands, self-demonstrating."—Robert A. Heinlein, "The Year of the Jackpot"

Facts are elusive critters. Far from being self-demonstrating, they are meaningless without context. "Theory determines what can be observed," Einstein once remarked to Heisenberg. We cannot accumulate answers without first asking a question. Pierre Duhem put it this way:

"Take two physicists who do not define pressure in the same manner because they do not admit the same theories of mechanics. One for example accepts the ideas of Lagrange;the other adopts the ideas of Laplace and Poisson. Submit to these two physicists a law whose statement brings into play the notion of pressure. They will hear the statement in two different ways. To compare it with reality, they will make different calculations so that one will find this law verified by facts which, for the other, will contradict it." [Emph.added] 1

—"Some Reflections on the Subject of Experimental Physics" (1894)

So much for the notion that facts alone can settle questions. One scientist measures an iridium-rich layer and sees the footprint of a comet impact; another sees the eruption of the Deccan Traps. Xenophanes saw marine fossils in the mountains of Ancient Greece and concluded that a primordial flood had once covered the world because he knew of no other natural process that could deposit marine creatures on mountaintops.

Let's take the speed of light:

Different results were obtained using different methods, or even by the same researcher using the same method! In measurement system analysis, variability among repeated determinations by the same individual is called "repeatability." 3 Collectively, the trend was toward slower light speeds. 4

Figure 1: Trend in measured light speed. The fitted line is a quadratic regression, just for slaps and giggles.

That light is actually slowing down is appealing to SF folks, but unlikely for a number of reasons. 5 Perhaps measurement methods have grown more accurate; but another possibility may not occur to anyone.

There ain't no such thing as the speed of light.

When measurement goes bad

Has Flynn lost his flipping mind? Let's consider an example:

The Allegory of Can Volume. Fill volumes of aluminum beverage cans are usually measured by weighing them before and after filling them with distilled water. The weight difference, given water temperature, converts to volume. A packer complained that delivered cans had excessive volume, which would allow carbonated beverages to gas out in the can, yet the can-maker's data were on target. The discrepancy was because the packer had taken the vacuum pump used to measure glass bottle volume and sucked the air out of the aluminum cans. But aluminum cans do something when you pull a vacuum on them that glass bottles don't. Patient Reader is invited to speculate.

Okay, the Vacuum Pump Method was clearly inappropriate for cans, but similar issues arise even between legitimate methods. The Inclined Plane and Horizontal Plane Methods for measuring the mobility of packaging, both ASTM 6 -approved, often produced different coefficient of friction on the self-same package. On a more celestial level: the Hubble mirror had the correct curvature when measured one way, but was out of focus when measured by an older method. Decision-makers went with the newer method—and the mirror had to get corrective lenses in a dramatic space-optometry mission.

Richard Feynman warned that results obtained on one apparatus might not hold for another:

"I was shocked to hear of an experiment done at the big accelerator at the National Accelerator Laboratory, where a person used deuterium. In order to compare his heavy hydrogen results to what might happen with light hydrogen, he had to use data from someone else's experiment on light hydrogen, which was done on different apparatus.

When asked why, he said it was because he couldn't get time... to do the experiment with light hydrogen on this apparatus because there wouldn't be any new result." 7 [Emph.added]

Operational definitions. Measurements are defined by the operations performed to obtain them. Change the operations—the instrument, method, operator, perhaps the ambient conditions—and the measurements change. The very existence of a fact is thus a human accomplishment. 8 This is a problem not only in the hard sciences, but even in the soft ones. 9

The Allegory of Infant Mortality. In 1998, Switzerland reported 4.8 infant deaths per 1,000 live births while the United States reported a shocking 7.2. But in Switzerland, infants 30 cm in length are classified as stillborn, not as live births. Roughly one-third of U.S. infant deaths fall into the equivalent range. Had they been omitted, the U.S. rate would have been roughly 4.8.

The apparent difference in infant mortality was largely an artifact of the measurement method!

Comparing statistics gathered at different times or places can thus be hazardous to your scientific health. You can't blithely assume the measurements were made the same way. 10

"It's not that the data are meaningless; but that they might not mean what we think they mean."—John Lukacs, The Passing of the Modern Age

Surrogate numberhood. When X is difficult or expensive to measure, we often measure a surrogate Y and use it to estimate X:

• Rockwell hardness instead of tensile strength

• Radiation backscatter instead of coal density

• Tree-ring width instead of temperature

• Redshift instead of galactic distance

• Blood flow in the brain instead of a "moment of decision"

Even thermometers use the height of mercury in a glass tube as a surrogate for actual temperature. Surrogacy requires a sound scientific relation, not merely a statistical correlation, as well as control of other factors that also influence Y. Tree rings are affected by rainfall and even by animal scat deposited on the roots. The relationship of backscatter to coal density differs from seam to seam and so must be calibrated for each shipment received by the power plant.

Figure 2: Calibration curve of (measured) Backscatter to (desired) Density from eleven coal samples packed into calibration tubes to specified densities. The outer lines show uncertainty in the prediction. If backscatter for a bunker measures 6000, the predicted coal density is ~ 67±3. Keep that ±3—the error term—in mind.

The accuracy and precision of the surrogate propagates backward to the desired value following mathematical laws, as illustrated in the figure. Never use the confidence interval to do this. The confidence interval is a bound on the parameter —i.e., on the slope—not on the skill of the model at predicting actual data. 11

When samples go bad

"A hundred faulty measurements are no better than ten faulty measurements."

—Michael F. Flynn, "Garbage Out"

A measurement is itself a sample. For example, there are infinitely many diameters along a shaft and around it. How many should we measure? At which locations? Will we report the mean of the replicates? The median? The extreme values? The truncated mean? 12

Geographic sampling is important in mining, water management, climatology, et al. But taking surface temperatures, for example, is like measuring shaft diameters. How many, and where? What if data are missing? Can missing measurements be estimated in some way? 13

A sample should be representative of the population. Two particular hazards are judgment samples (cherry-picking) and convenience samples. For example, in sampling from pallets fed by three machines, shown in the bird's eye view below, only units around the outer edges of the pallet were accessible to the technician. Hence, Machine 2, which produced 33% of the target population, made up only 19% of the sample. This convenience sample could seriously bias the estimates.

Figure 3: Bird's-eye view of one layer of a pallet being filled by three machines.

A sample represents only the population from which it w
as drawn. A bank once estimated annual accounting errors by verifying all transactions for the month of July and multiplying by 12. It would be impossible to explain how this could ever possibly be right. The sample represented only the month of July, not the entire year. Fatigue, nonresponse, unclear definitions, incomplete sampling frames, etc. produce non-sampling variation, which cannot be corrected by adjusting sample sizes. I have seen technicians making measurements on autopilot, and even a few cases where the measurements were recorded without the ugly necessity of actually making them.

The cure for judgment sampling is random selection, and for convenience sampling, stratification. 14

Two species of variation: A short but necessary digression.

A process is "a set of causes that work together to produce a result." 15 Since these causes are not constant, the results will vary. A process is called stable ("random") if the long term variation is about the same as the short term variation. There is no particular cause for a random fluctuation. It is built into the design of the process. If a pair of dice shows 12, the outcome is due to a chance combination of many causes, not any one particular cause.

But if a process is unstable (long-term variation greater than short-term), the excess variation can be assigned to a particular cause. If that pair of dice shows 13, something funny is going on. Assignable variation is typically due to breakdowns of the plan and outside forces.

In particular, there is no "measurement" unless the measurement process is stable.

Figure 4: An unstable process. External variation from time to time exceeds internal variation at particular times. Bell curves of short-term variation are used for illustration only.

When models go bad

"All models are wrong, but some are useful."

—George E. P. Box, "Robustness in the Strategy of Scientific Model Building"

Unless shrouded by an inert gas, molten steel sucks nitrogen from the air. Since shrouding is always imperfect, N 2 increases each time steel is recycled, making it progressively more brittle. If a certain grade exceeds 90ppm N 2 it becomes too brittle for its specified application. So what is the probability of a heat exceeding 90 ppm?

Surprise! There is no such probability! There is only a probability given some model. The histogram below shows that for 153 consecutive heats, the Normal distribution is not a bad model for nitrogen content. The model predicts that 25% of heats will exceed 90 ppm N 2 versus 22% in the actual sample. However, while the fish stinks from the head down, models begin to smell in their tails, i.e., at the extreme values. A normal distribution runs to ±= and no physical process will ever vary quite that much. The model predicts 0.98% of heats will exceed 104 ppm, but 2.61% actually did so. The more extreme the values, the more the model goes awry. So, rare event probabilities are usually wrong, whether of brittle steel or giant meteor impacts. "For practical purposes," the Normal distribution ends at ±3æ and any values beyond that are taken as signals of data outside the model. The model is an approximation of the reality, not vice versa. 16

Figure 5: The Ugly World of Reality™ (left) versus the elegant and pure Platonic Form (right).

But "normality is not normal." Non-normal processes abound; e.g., the number of calls per day received on a maintenance hotline was distributed nearly as Poisson:

Figure 6: The number of calls received each day for a sample of 100 days. A square root transformation might normalize such distributions.

And sometimes there is no distribution. This won't stop anyone taught cookbook statistics from diving headfirst into the Normal Tables and hitting his head on a rock.

Figure 7: Normalize this, monkey boy! These emulsion cycle times had no distribution at all. (Quality engineers will suspect that there were four of something, and one was different. There were, and it was.)

When are data not data?

In this plot of pharmaceutical fill weights the penultimate point doesn't seem to "belong."

Figure 8: Time series of fill weights for Port 5. Something odd happened at 6:00 A.M., and you don't need a degree in statistics or a Tiny p-Value™ to discover that.

Spikes (or icicles) often indicate measurement problems. The "75" recorded at 6:00 A.M. was a keystroke error for "57." But sometimes outliers are important signals. In a graph of global temperature "anomalies," spikes indicated El Niños. Never remove outliers from your data unless you know how they got there in the first place. Algorithms that automatically adjust "outliers" are not a good idea.

When averages go bad

An old statistician's joke runs, "If you stick your head in the oven and your feet in the freezer, on the average you're comfortable." Which is why most statisticians have not quit their day jobs and headed off for the comedy clubs. An average is a measure of central tendency, and sometimes there just isn't one.

The Allegory of the pH Working Standard. A liquid standard, certified as 3.3 pH ± 0.1, was kept in a vault and used to calibrate the lab's pH meter each morning. These calibration checks were later plotted over a sixty-day period:

Figure 9: Time series of daily pH calibration check. The lab had a basic problem. 17

The median was indeed 3.3, and nearly all individual determinations fell between 3.2 and 3.4, but the measurement series is not stable. In effect, there is no measurement of pH and there is no central tendency.

Footprints in the data tell us what kind of cause made it. A trend signals a cause acting steadily in one direction over a period of time, such as "wear" or "accumulation." In the pH case, the icicles indicated lab errors and the trend was due to improper cleaning of the probe. Production material transferred on the dirty probe was gradually poisoning the standard, making it more alkaline over time.

Alternate History Statistics. Quality engineers track footprints to discover causes. But suppose (for some reason) you wanted to know how the standard might have performed had it not been poisoned. Simple. Let us regress. [Perform mathemagics here.] Predicted pH is-Yˆ=3.239+0.0017X, where X is Day. Now for each day, take (Y-Yˆ), the deviation of the measured value from the value predicted by the regression model. Such deviations are called residuals. A residuals plot shows the variation in the process with the trend removed, i.e., the variation unaccounted for by the regression.

Figure 10: A counterfactual chart in which we pretend that the trend did not happen. (The spikes have also been omitted.)

"Before the observations can be adjusted, they must arise from a random operation."

—W. Edwards Deming, Statistical Adjustment of Data

When processes go bad

The Allegory of the Battery Grids. A pasting machine applies a lead oxide paste onto a lead grid to make the innards of an automotive battery. A special study took subgroups of five consecutive grids every fifteen minutes and used the internal subgroup variation to estimate short-term (stroke-to-stroke) variation of the machine.

We can plot the means and ranges of the subgroups on what is called (appropriately enough) an Xbar and R chart. 18 The range chart (bottom panel) shows the short-term consistency; the average chart (top panel) shows the long-term stability. From the average range, by sundry means of the Statistician's Black Arte, which I could teach you, but then I would have to brick you inside the wall of the wine cellar, we can calculate the limits of short-term variation. The chart asks: is there more variation in the long term than there is on the average in the short term?

Figure 11: An unstable process called a shift. The probability limits on the averages are calculated from the ranges only. That way, long-term instability doesn't affect the limits.

The long-term variation of paste weights exceeds the limits set by the short-term variation. This means there are assignable causes. A shift took place after the first 12 samples. There is no grand mean, no fit, no model. You can't fit an average to an unstable process.

A shift differs from a trend. Instead of a cause pushing monotonically over time, a shift means an assignable cause entered the process at a particular time. (Thi
nk "new" or "changed.") In this case, the shift at 4:00 P.M. was due to increased density in the PbO paste delivered from the paste mill.

Figure 12: A shift is not a trend. You could draw a regression line through the time series. You could also jump off a cliff with the lemmings. That doesn't make the linear lemmings right, even if the regression has a Real Small p-Value.™ And you had darned well not mindlessly project that regression line into the future!

Closer analysis also identified "jumps" at each hour. The operator tweaked the machine whenever the QC patrol inspector was scheduled to come by! 19

In process quality control, the important thing is to identify and eliminate causes of variation, not to spank the data until it behaves itself. But in dealing with "ambient" data, we cannot eliminate the cause physically, so we try to filter it analytically to see if it is masking other causes.

The paste weight example is more complicated than the pH example. The four subgroups between operator tweaks form a block unaffected by either the tweaking or the change in paste density. If we calculate the residuals from the block means and plot them:

Figure 13: Subtracting the block mean from each datum yields this counterfactual chart of what the process "would have done" had it not been for the two special causes.

‹ Prev Next ›