How Much More Different? Enter the Normal Distribution
Everyone has heard the phrase normal distribution or bell-shaped curve, or, as in the title of a controversial book, bell curve. They all refer to a common way that natural phenomena arrange themselves approximately (the true normal distribution is a mathematical abstraction that is never perfectly observed in nature). If you look again at the distribution of high school boys that opened the discussion, you will see the makings of a bell curve. If we added several thousand more boys to it, the kinks and irregularities would smooth out, and it would actually get close to a normal distribution. A perfect one looks like the one in the figure below.
It makes sense that most things will be arranged in bell-shaped curves. Extremes tend to be rarer than the average. If that sounds like a tautology, it is only because bell curves are so common. Consider height again. Seven feet is “extreme” for humans. But if human height were distributed so that equal proportions of people were five feet, six feet, and seven feet tall, the extreme would not be rarer than the average. It just so happens that the world hardly ever works that way.
Bell curves (or close approximations to them) are not only common in nature, they have a close mathematical affinity to the meaning of the standard deviation. In any true normal distribution, no matter whether the elements are the heights of basketball players, the diameters of screw heads, or the milk production of cows, 68.3 percent of all the cases fall in the interval between one standard deviation above the mean and one standard deviation below it.
In its mathematical form, the normal distribution extends to infinity in both directions, never quite reaching the horizontal axis. But for all practical purposes, when we are talking about populations of people, a normal distribution is about six standard deviations wide. The numbers below the axis in the figure above designate the number of standard deviations above and below the mean. As you can see, the line has virtually touched the surface at ±3 standard deviations.
A person who is one standard deviation above the mean in IQ is at the 84th percentile. Two standard deviations above the mean puts him at the 98th percentile. Three standard deviations above the mean puts him at the 99.9th percentile. A person who is one standard deviation below the mean is at the 16th percentile. Two standard deviations below the mean puts him at the 2nd percentile. Three standard deviations below the mean puts him at the 0.1th percentile.
Why Not Just Use Percentiles to Begin With?
Why go to all the trouble of computing standard scores? Most people understand percentiles already. Tell them that someone is at the 84th percentile, and they know right away what you mean. Tell them that he’s at the 99th percentile, and they know what that means. Aren’t we just introducing an unnecessary complication by talking about “standard scores”?
Thinking in terms of percentiles is convenient and has its legitimate uses. I often speak in terms of percentiles in the text. But they can also be highly misleading, because they are artificially compressed at the tails of the distributions. It is a longer way from, say, the 98th percentile to the 99th than from the 50th to the 51st. In a true normal distribution, the distance from the 99th percentile to the 100th (or, similarly, from zero to the 1st) is infinite.
Consider two people who are at the 50th and 55th percentiles in height. Using a large representative sample from the National Longitudinal Study of Youth (NLSY) as our estimate of the national American distribution of height, their actual height difference is only half an inch. Consider another two people who are at the 94th and 99th percentiles on height—the identical gap in terms of percentiles. Their height difference is 3.1 inches, six times the height difference of those at the 50th and 55th percentiles. The farther out on the tail of the distribution you move, the more misleading percentiles become.
Standard scores reflect these real differences much more accurately than do percentiles. The people at the 50th and 55th percentiles, only half an inch apart in real height, have standard scores of 0.0 and 0.13. Compare that difference of 0.13 standard deviation units to the standard scores of those at the 94th and 99th percentiles: 1.55 and 2.33 respectively. In standard scores, their difference—which is 0.78 standard deviation units, equivalent to an effect size of 0.78—is six times as large, reflecting the sixfold difference in inches.
Correlation and Regression
So much for describing a distribution of measurements. We now need to consider dealing with the relationships between two or more distributions—which is, after all, what scientists usually want to do. How, for example, is the pressure of a gas related to its volume? The answer is the Boyle’s law you learned in high school science. In social science, the relationships between variables are less clear-cut and harder to unearth. We may, for example, be interested in wealth as a variable, but how shall wealth be measured? Is it yearly income, yearly income averaged over a period of years, the value of one’s savings or possessions? And wealth, compared to many of the other things social scientists study, is easy, reducible as it is to dollars and cents.
Beyond the problem of measurement, social scientists must cope with sheer complexity. It is rare that any human or social relationship can be fully captured in terms of a single pair of variables, such as that between the pressure and volume of a gas. In social science, multiple relationships are the rule, not the exception.
For both of these reasons, the relations between social science variables are typically less than perfect. They are often weak and uncertain. But they are nevertheless real, and with the right methods, they can be rigorously examined.
Correlation and regression are the primary ways to quantify weak, uncertain relationships. For that reason, the advances in correlational and regression analysis since the late nineteenth century have provided the impetus to social science. To understand what this kind of analysis is, we need to introduce the idea of a scatter plot.
A Scatter Plot
We left your male high school classmates lined up by height, with you looking down from the rafters. Now imagine another row of cards, laid out along the floor at a right angle to the ones for height. This set of cards has weights in pounds on them. Start with 90 pounds for the class shrimp and continue to add cards in 10-pound increments until you reach 250 pounds to make room for the class giant. Now ask your classmates to find the point on the floor that corresponds to both their height and weight (perhaps they’ll insist on a grid of intersecting lines extending from the two rows of cards). When the traffic on the gym floor ceases, you will see something like the following figure.
Some sort of relationship between height and weight is immediately obvious. The heaviest boys tend to be the tallest, the lightest ones the shortest, and most of them are intermediate in both height and weight. Equally obvious are the deviations from the trend that link height and weight. The stocky boys appear as points above the mass, the skinny ones as points below it. What we need now is some way to quantify both the trend and the exceptions.
Correlations and regressions accomplish this in different ways. But before we go on to discuss these terms, be assured that they are simple. Look at the scatter plot. You can see just by looking at the dots that as height increases, so does weight, in an irregular way. Take a pencil (literally or imaginarily) and draw a straight, sloping line through the dots in a way that seems to you to best reflect this upward-sloping trend. Now continue to read and see how well you have intuitively produced the basis for a correlation coefficient and a regression coefficient.
The Correlation Coefficient
Modern statistics provide more than one method for measuring correlation, but we confine ourselves to the one that is most important in both use and generality: the Pearson product-moment correlation coefficient (named after Karl Pearson, the English mathematician and biometrician). To get at this coefficient, let us first replot the graph of the class, replacing inches and pounds with standard scores. The variables are now expressed in general terms. Remember: Any set of measurements can b
e transformed similarly.
The next step on our way to the correlation coefficient is to apply a formula that finds the best possible straight line passing through the cloud of points—the mathematically “best” version of the line you just drew by intuition.
What makes it the “best”? Any line is going to be wrong for most of the points. Take, for example, the boys who are 64 inches tall and look at their weights. Any sloping straight line is going to cross somewhere in the middle of those weights and will probably not cross any of the dots exactly. For boys 64 inches tall, you want the line to cross at the point where the total amount of the error is as small as possible. Taken over all the boys at all the heights, you want a straight line that makes the sum of all the errors for all the heights as small as possible. This “best fit” is shown in the new version of the scatter plot shown below, where both height and weight are expressed in standard scores and the mathematical best-fitting line has been superimposed.
This scatter plot has (partly by serendipity) many lessons to teach about how statistics relate to the real world. Here are a few of the main ones:
1. Notice the many exceptions. There is a statistically substantial relationship between height and weight, but, visually, the exceptions seem to dominate. So too with virtually all statistical relationships in the social sciences, most of which are much weaker than this one.
2. Linear relationships don’t always seem to fit very well. The best-fit line looks as if it is too shallow—notice the tall boys and see how consistently the line underpredicts how much they weigh. Given the information in the diagram, this might be an optical illusion—many of the dots in the dense part of the range are on top of each other, as it were, and thus it is impossible to grasp visually how the errors are adding up—but it could also be that the relationship between height and weight is not linear.
3. Small samples have individual anomalies. Before we jump to the conclusion that the straight line is not a good representation of the relationship, we must remember that the sample consists of only 250 boys. An anomaly of this particular small sample is that one of the boys in the sample of 250 weighed 250 pounds. Eighteen-year-old boys are rarely that heavy, judging from the entire NLSY sample, only one or two per 10,000. And yet one of those rarities happened to be picked up in a sample of 250. That’s the way samples work.
4. But small samples are also surprisingly accurate, despite their individual anomalies. The relationship between height and weight shown by the sample of 250 18-year-old males is identical to the third decimal place with the relationship among all 6,068 males in the NLSY sample (the correlation coefficient is .501 in both cases). This is closer than we have any right to expect, but other random samples of only 250 generally produce correlations that are within a few points of the one produced by the larger sample. (There are mathematics for figuring out what “generally” and “within a few points” mean, but we needn’t worry about them here.)
Bearing these basics in mind, let us go back to the sloping line in the figure above. Out of mathematical necessity, we know several things about it. First of all, it must pass through the intersection of the zeros (which, in standard scores, correspond to the averages) for both height and weight. Secondly, the line would have had exactly the same slope had height been the vertical axis and weight the horizontal one.
Finally, and most significant, the slope of the best-fitting line cannot be steeper than 1.0. The steepest possible best-fitting line, in other words, is one along which one unit of change in height is exactly matched by one unit of change in weight, clearly not the case in these data. Real data in the social sciences never yield a slope that steep. Note that while the line in the graph above goes uphill to the right, it would go downhill for pairs of variables that are negatively correlated.
We focus on the slope of the best-fitting line because it is the correlation coefficient—in this case, equal to .50, which is quite large by the standards of variables used by social scientists. The closer it gets to ±1.0, the stronger is the linear relationship between the standardized variables (the variables expressed as standard scores). When the two variables are mutually independent, the best-fitting line is horizontal; hence its slope is 0. Anything other than 0 signifies a relationship, albeit possibly a very weak one.
Whatever the correlation coefficient of a pair of variables is, squaring it yields another notable number. Squaring .50, for example, gives .25. The significance of the squared correlation is that it tells us how much the variation in weight would decrease if we could make everyone the same height, or vice versa. If all the boys in the class were the same height, the variation in their weights would decline by 25 percent. Perhaps you have heard the phrase “explains the variance,” as in, for example, “Education explains 20 percent of the variance in income.” That figure comes from the squared correlation.
In general, the squared correlation is a measure of the mutual redundancy in a pair of variables. If they are highly correlated, they are highly redundant in the sense that knowing the value of one of them places a narrow range of possibilities for the value of the other. If they are uncorrelated or only slightly correlated, knowing the value of one tells us nothing or little about the value of the other.
Regression Coefficients
Correlation assesses the strength of a relationship between variables. But we may want to know more about a relationship than merely its strength. We may want to know what it is. We may want to know how much of an increase in weight, for example, we should anticipate if we compare 66-inch boys with 73-inch boys. Such questions arise naturally if we are trying to explain a particular variable (e.g., annual income) in terms of the effects of another variable (e.g., educational level). “How much income is another year of schooling worth?” is just the sort of question that social scientists are always trying to answer.
The standard method for answering it is regression analysis, which has an intimate mathematical association with correlational analysis. If we had left the scatter plot with its original axes—inches and pounds—instead of standardizing them, the slope of the best-fitting line would have been a regression coefficient, rather than a correlation coefficient. For example, the regression coefficient for weight regressed on height tells us that for each additional inch in height, we can expect an increase of 3.9 pounds. Or we could regress height on weight and discover that each additional pound of weight is associated with an increase of .065 inches in height.
Multivariate Statistics
Multiple regression analysis is the main way that social science deals with the multiple relationships that are the rule in social science. To get a fix on multiple regression, let us return to the high school gym for the last time. Your classmates are still scattered about the floor. Now imagine a pole, erected at the intersection of 60 inches and 80 pounds, marked in inches from 18 inches to 50 inches. For some inscrutable reason, you would like to know the impact of both height and weight on a boy’s waist size. Since imagination can defy gravity, you ask each boy to levitate until the soles of his shoes are at the elevation that reads on the pole at the waist size of his trousers. In general, the taller and heavier boys must rise the most, the shorter and slighter ones the least, and most boys, middling in height and weight, will have middling waist sizes as well. Multiple regression is a mathematical procedure for finding that plane, slicing through the space in the gym, that minimizes the aggregated distances (in this instance, along the waist size axis) between the bottoms of the boys’ shoes and the plane.
The best-fitting plane will tilt upward toward heavy weights and tall heights. But it may tilt more along the pounds axis than along the inches axis or vice versa. It may tilt equally for each. The slope of the tilt along each of these axes is again a regression coefficient. With two variables predicting a third, as in this example, there are two coefficients. One of them tells us how much of an increase in trouser waist size is associated with a given increase in weight, holding height constant; the other, how
much of an increase in trouser waist size is associated with a given increase in height, holding weight constant.
With two variables predicting a third, we reach the limit of visual imagination. But the principle of multiple regression can be extended to any number of variables. Income, for example, may be related not just to education, but also to age, family background, IQ, personality, business conditions, region of the country, and so on. The mathematical procedures will yield coefficients for each of them, indicating again how much of a change in income can be anticipated for a given change in any particular variable, with all the others held constant.
Adapted from The Bell Curve: Intelligence and Class Structure in American Life by Richard J. Herrnstein and Charles Murray. Copyright © 1994 by Richard J. Herrnstein and Charles Murray. Reprinted with permission of The Free Press, a Division of Simon & Schuster Trade Publishing Group. I have made a few minor changes to the original text, eliminating some material specific to issues in The Bell Curve and rewording a few sentences to fit the context of Human Diversity.
Appendix 2
Sexual Dimorphism in Humans
Homo sapiens is a normally dimorphic species consisting overwhelmingly of heterosexual males and females with a small proportion of exceptions. I realize that I am writing in the LGBT era when some argue that 63 distinct genders have been identified. But while that opening statement constitutes fighting words in some circles, it is not scientifically controversial. If you are already convinced that human beings are a normally dimorphic species, you may want to skip this appendix.
Human Diversity Page 40