by Andy Field
Table 4.2
The mean is actually the value that produces the lowest value of the total squared deviations and so of all values we could choose to summarize a data set, the mean is the value that gives rise to the least error (see Box 4.4). However, although it is the value that produces the least error, the error it does produce can still be substantial, which is why we are using the deviations between the raw scores and the mean to gauge the accuracy of the mean as a model of the data. The sum of squared deviations is more commonly referred to as the sum of squared errors (or SS for short). In itself, the sum of squared errors is a good measure of the accuracy of the mean. However, it has one obvious problem – can you think what it might be?
Box 4.4: Why is the mean the best model of the data?
All statistical models have some kind of error: that is they make a prediction about a person’s score that will, in most cases, differ somewhat from the actual score that the person obtained. If we use our mean as a statistical model then we’re predicting that a person’s score is equal to the mean. Say we took 20 cats and measured the number of catnip sweets they could eat before falling asleep (catnip is pretty much the feline equivalent of cannabis!), the mean might be 34 sweets. If someone then asked ‘how many sweets did Fuzzy eat before falling asleep?’ if we don’t have the raw scores, then our model (the mean) predicts that Fuzzy ate 34 sweets before passing out. However, this prediction could be wrong, Fuzzy may have only eaten 27 sweets (there is an error of 7 sweets), or he might have eaten 35 sweets (there is an error of 1). The mean is the best model of a data set because it is designed to be the value that produces the least error. So, if we want a model of the data that uses all of the observed scores (remember that the median and mode do not use all scores) then the mean is the best model we can use. However, just because the mean produces the least error this is not to say that the mean is necessarily a good model, it just produces less error than any other value we might choose as a model.
Let’s look at an example that demonstrates how the mean value is always the value that produces the least error. If we use the hallucination data from the main text, the mean was 3.57. Now what error do we get if we were to use a different value as our model? Say we used two values very close to the mean like 3.56 and 3.58, and one fairly dissimilar such as 4 (see Table 4.3). We could look at the squared deviations for these three models and compare them to what we get when we use the mean.
Table 4.3
Notice that when we used the mean (3.57), the total squared deviation was 33.7143 (see Table 4.2), when we use a value just below the mean (3.56) the total squared deviation increases slightly to 33.7152. When we use a value just above the mean (3.58) the total squared deviation is still slightly higher than when we use the mean (33.7148 compared to 33.7143), and when we use a value quite far from the mean (4), the total squared deviation increases a fair bit (from 33.7143 to 35).
Imagine that we asked another seven amphetamine users about their hallucinations, what would happen to the sum of squared deviations? Well, we’d have twice as many squared deviations as before so when we added them all up the total would probably be about twice as large (and even if not twice as large it would certainly be bigger). So, the problem with the sum of squared errors as a measure of accuracy is that it gets bigger the more values we have in the data set. Although it would be meaningful to compare the current sum of squared errors with that found in say seven people who didn’t use amphetamines, it wouldn’t make sense to compare it with a group of a different size. To overcome this problem, we need to take account of the number of scores on which the sum of squared deviations is based. The obvious way to do this is just to divide by the number of scores (N). This in effect gives us the mean squared error. If we are interested only in the average error for the sample, then we can divide by N alone (see Box 4.6). However, we are generally interested in using the error in the sample to estimate the error in the population and so we divide the SS by the number of observations minus 1 (the reason why is explained in Box 4.5). This measure is known as the variance:
The variance is the average squared deviation between the mean and an observed score and so tells us on average how much a given data point differs from the mean of all data points. The variance is an incredibly useful measure because we can compare it across samples that contain different numbers of observations, however, it still has a minor problem: it is measured in units squared (because we squared each deviation during the calculation). As such, we would have to say that the average error in our data was 5.62 hallucinations-squared. It may be ridiculous to talk about 5.62 hallucinations, but it makes even less sense to talk about hallucinations-squared (how do you square a hallucination?). For this reason, we often convert the average error back into the original units of measurement by taking the square root of the variance; this measure is known as the standard deviation (Box 4.6). In this example the standard deviation is:
The SS, variance and standard deviation are all measures of the same thing: the accuracy of the mean, or the variability of the data (bearing in mind that the mean will be most accurate when the scores are all similar, and will be most inaccurate when the scores are very different). These measures are all proportionate to each other, that is, a large SS (relative to the number of scores) will result in a large variance, which will result in a large standard deviation. It’s important to remember in the forthcoming chapters that sums of squares, variances and standard deviations all measure the same thing. If the standard deviation is a measure of how well the mean represents the data then small standard deviations (relative to the value of the mean itself) indicate that data points are close to the mean and large standard deviations (relative to the mean) indicate that the data points are distant from the mean (i.e., the mean is not an accurate representation of the data).
Imagine we collect information about hallucinations from a new sample of seven amphetamine users. Figure 4.7 shows their data. The average amount of hallucinations for this new sample is the same as our old sample (3.57); however, in this new sample the scores are much more tightly packed around the mean value (compare Figure 4.7 with Figure 4.6 to see this difference). The standard deviation of this new sample turns out to be 0.53, compared to 2.37 in our old sample. It should be clear from the two graphs that the mean better represents the new sample (in which all values are close to the mean) than it does the original sample (in which values vary much more around the mean). This accuracy is reflected in the standard deviation, which is small (compared to the mean) for the new sample and large (compared to the mean) in the old sample. This illustration hopefully clarifies why the standard deviation tells us about how well the mean represents the data.
Box 4.5: Why do we use N – 1?
When we calculate the variance (or average sum of squared errors) we don’t divide the sum of squared errors by the number of observations (N), like you typically would when calculating an average. Instead, we divide by N – 1. The simple reason why is because we are using the sample to estimate the variance in the population. In fact, if we wanted to calculate the variance of the sample and weren’t interested in the population at all, then we could divide by N. So, why does the fact that we’re using the sample to estimate the variance in the population mean that we have to use N – 1?
Let’s begin to answer this question using an analogy. Now, I sometimes spend my Saturday afternoons playing rugby (badly!) for Brighton third XV. On a typical Saturday the captain, despite his best efforts to structure a team, will usually have to wait for people to turn up before deciding which positions to put them in. There are 15 positions in rugby (union) and each requires different skills. When the first player arrives, the captain has the choice of 15 positions in which to place this player. This player might be a burly 22 stone muscle man, so you decide that he should be a prop in the scrum (a good position for burly men), so the captain places him in position 1, therefore, one position on the pitch is now occupied. Say I arrived next (I’m kee
n), the captain still has the choice of 14 positions in which he can play me – he has the freedom to choose a position. Now I’m kind of puny, weak, skinny and crap so you might decide to put me on the wing because the only thing I’m good at is running away from burly 22 stone men (and if I was put in the scrum my back would snap!), so that’s another position gone – there are only 13 left. As more players arrive, the captain will reach the point at which 14 of the 15 positions have been filled. When the final player arrives the captain has no freedom to choose where he (or she) plays – there is only one position left. So, we can say there are 14 degrees of freedom, that is, for 14 players you have some degree of choice over where they play, but for 1 player you have no choice. The degrees of freedom are one less than the number of players.
How does this analogy relate to samples and populations? Well, if we take a sample of five observations from a population, then these five scores are free to vary in any way – they can be any value (e.g. 5, 6, 2, 9, 3). Now, when we use samples to infer things about the population from which they came the first thing we do is assume that the sample mean is the same as that in the population (so, for these data we fix the population mean at 5). As such, we hold one parameter constant. With this parameter fixed, can all five scores now vary? The answer is no because to keep the mean constant only four values are free to vary. For example, if we collected five new scores and the first four were 5, 7, 1, and 8, then the final value must be 4 to ensure that the mean is equal to the value that we fixed it at (5). Therefore, if we hold one parameter constant then the degrees of freedom must be one less than the sample size (N – 1). So in statistical terms the degrees of freedom relates to the number of observations that are free to vary.
Box 4.6: Standard deviations in the population
Something a lot of students get confused about is whether we’re using N or N – 1 in the equation for the standard deviation. I mentioned in the main text that if we’re calculating the standard deviation for a sample (and we’re not interested in generalizing from our sample to the entire population) then we can calculate the standard deviation (the average variance) by dividing by the number of scores we collected:
However, if we want to use our sample to estimate the standard deviation of the population (and in psychology this is nearly always what we want to do), then we have to use the number of data points we collected minus 1:
Why do we use N – 1? Well, the reason is the same as for the degrees of freedom (see Box 4.5). That is, to estimate the standard deviation of the population we have to first estimate the mean of that population by using the sample mean. If we fix the population mean at a certain value then only N – 1 scores will be free to vary (because the last score will have to take on the value that brings the mean to the value at which we’ve fixed it – see Box 4.5). Therefore, because we fix the population mean to estimate the population standard deviation, we have only N – 1 scores that are free to vary and so one score, because it cannot vary, must be excluded from the calculation.
Figure 4.7 Number of hallucinations for seven amphetamine users (circles)
The Standard Deviation and the Shape of the Distribution
Not only do the variance and standard deviation tell us about the accuracy of the mean as a model of our data set, it also tells us something about the shape of the distribution of scores, If you think about what we’ve learnt about the mean we know that if the mean represents the data well, most of the scores will cluster close to the mean (and the standard deviation will be small relative to the mean) – as such, the distribution of scores will be quite thin. When data are more variable, they will be spread further away from the mean and so the distribution of scores will be fatter (scores distant from the mean will occur more frequently than when the standard deviation is small). Figure 4.8 illustrates this point by showing two distributions that have the same mean (50) but different standard deviations. On the left-hand side, the distribution has a standard deviation of 20 and this results in a flatter distribution that is more spread out (scores distant from the mean do occur with a reasonable frequency). The distribution on the right has a lower standard deviation (15) and this results in a slightly more pointy distribution in which scores close to the mean are very frequent but scores further from the mean become increasingly infrequent. The main point to note from the figures is that as the standard deviation gets larger, the distribution gets fatter.
Figure 4.8 Two distributions that have the same mean but different standard deviations
The Standard Error: How Well Does My Sample Represent the Population?
So far we’ve learnt how we can summarize a set of data using the mean, and that we can assess the accuracy of that mean using the standard deviation. This, in itself, is very useful because it tells us whether our sample mean is representative of the scores within the sample. However, at the beginning of this chapter we talked about how scientists use samples to discover what’s going on in a population (to which they don’t have access). As such, the next step on from looking at how well the mean represents the sample is to look at how well the sample represents the population.
When I was talking about the advantages of the mean as a summary of a sample, I mentioned that the mean was resistant to sampling variation; that is, if we were to take different samples from the same population these samples would usually have fairly similar means. I’m now going to come back to this idea. When someone takes a sample from a population, they are taking one of many possible samples. If we were to take several samples from the same population, then each sample has its own mean, and some of these sample means will be different (not every sample will have the same mean).
Figure 4.9 shows the process of taking samples from a population. Imagine we were interested in how many units of alcohol it would take a man before they would snog a Labrador called Ben. Just suppose that in reality, if we tested every man on the planet we’d find that it takes them 10 units on average (about 5 pints of lager) before they would snog dear old Ben. Of course, we can’t test everyone in the population so we use a sample. Actually, to illustrate the point we take 9 samples (shown in the diagram). For each of these samples we can calculate the average, or sample mean. As you can see in the diagram, some of the samples have the same mean as the population and some have different means: the first sample of men will snog Ben after an average of 10 units but the second sample will do the deed after an average of only 9. We can actually plot the sample means as a frequency distribution just like I have done in the diagram. This distribution shows that there were three samples that had a mean of 10, means of 9 and 11 occurred in two samples each, and means of 8 and 12 occurred in only one sample each. The end result is a nice symmetrical distribution known as a sampling distribution. A sampling distribution is simply the frequency distribution of sample means from the same population. In theory we’d take hundreds or thousands of samples to construct a sampling distribution, but I’m just using 9 to keep the diagram simple! OK, so the sampling distribution tells us about the behaviour of samples from the population, and you’ll notice that it is centred at the same value as the mean of the population (i.e., 10). This means that if we took the average of all sample means we’d get the value of the population mean. Now, if the average of the sample means is the same value as the population mean, then if we know the accuracy of that average then we’d know something about how likely it is that a given sample is representative of the population. To work out the accuracy of the mean of the sample means we could again just look at the sampling distribution: if it is very spread out then it means that sample means tend to vary a lot and can be quite different, whereas if the distribution is quite thin then most sample means are very similar to each other (and the population mean). If we wanted a value to represent how accurate a sample is likely to be then obviously we can simply calculate the standard deviation of the sampling distribution (that is, the standard deviation of sample means). In the example in which we took 9 samples, the standard d
eviation of sample means is 1.22 and this value is known as the standard error of the mean (SE). The standard error is simply the standard deviation (or variability) of sample means. A large value tells us that sample means can be quite different from each other and, therefore, a given sample may not be representative of the population. It could be that we just happened to pick a sample that is full of people who will snog a Labrador when they are sober, or a sample of people who need to drink copious amounts before they’d even think about exchanging saliva with a canine. Small values of the standard error tell us that the sample is likely to be a fair reflection of the population (because sample means are all very similar), and so this means that although it is still possible that we happen to have got a sample of Labrador-snogging weirdos, it is much less likely (because these extreme samples are not very common).
Figure 4.9 Diagram showing how a sampling distribution is created
Of course, in reality we cannot collect hundreds of samples, calculate their means, and then calculate the standard deviation of those sample means and so we rely on an approximation of the standard error. Fortunately, there are people who love nothing more than doing really hard statistical stuff and they have come up with ways in which the sample standard deviation can be used to approximate the standard error. We don’t need to understand why this approximation works (thank goodness!) we can just trust that these people are ridiculously clever and know what they’re talking about. The standard error can be calculated by dividing the sample3 standard deviation (s) by the square root of the sample size (N):