How to Design and Report Experiments
Page 15
Box 4.1: Problems with the mode
Several Modes
The picture below shows a bimodal distribution. Notice how there are two humps: this shape is characteristic of bimodal distributions (and camels!). Multimodal distributions might have three or more humps, each one representing a very common score. One classic example of a bimodal distribution is marks on research methods courses. Typically students are either really good or really bad at research methods and so you tend to have a mode around 70% for the people who understand statistics and another one at around 40% representing those who struggle. My aim in life is to eliminate the lower mode so there is just one at the high end of the scale!
Changing scores
Imagine we got 10 people to rate their statistics lecturer on a scale of 1–10 (I being ‘completely rubbish’ and 10 being ‘I love him’). The scores are as follows:
2, 2, 3, 4, 5, 8, 9, 9, 9, 10
The mode is 9 (it has a frequency of three, which is higher than any other score), so we might conclude that this lecturer is really good because typically he was rated highly. Imagine now that we gave students the opportunity to change their minds and asked them to re-rate the lecturer. The student who rated the lecturer as three the first time around changes his or her mind and decides that because the lecturer tells useless jokes the whole time they’ll rate him as 2. The data become:
2, 2, 2, 4, 5, 8, 9, 9, 9, 10
There are now two modes: 2 and 9. The distribution has become bimodal and is hard to interpret because the modes are at opposite ends of the scale. So we wouldn’t know whether the lecturer is good or bad. Then a latecomer enters the room and she is given the chance to rate the same lecturer. She also thinks the lecturer’s jokes are really unfunny and gives him a 2. So now there is one more score of 2 and the mode would change to 2 and we’d sack the lecturer. However, if that next student had happened to be someone who liked the lecturer and had rated him as 9, then the mode would’ve changed to 9 and we’d have given the lecturer a pay rise. The point is that even a single score can dramatically alter the value of the mode (especially in small samples): it is very unstable.
Number of hallucinations: 1, 5, 8, 4, 2, 2, 3.
First, arrange these scores into descending order:
Number of hallucinations: 8, 5, 4, 3, 2, 2, 1.
Next, we find the position of the middle score:
In this equation n is simply the number of scores we’ve collected. So, for the current example, in which there were seven scores:
Finally, we find the score that is positioned at the location we’ve just calculated. So, in this example we find the 4th score:
When there is an even number of scores we will have two central locations. If we looked at how bungee jumping affects spine length, we could measure people’s spines before and after they do a jump and calculate the difference in length. If we did this for 16 jumpers our data might look like this (the change in spine length is in millimetres):
9, 10, 7, 20, 57, 15, 19, 30, 37, 24, 2, 0, –4, –6, 16, 11
First, we must arrange these scores in order:
–6, –4, 0, 2, 7, 9, 10, 11, 15, 16, 19, 20, 24, 30, 37, 57
Next, we find the position of the middle score by using the equation above. For the current example, in which there were 16 scores:
This means that the median is half way between the 8th and 9th scores. So we simply average these scores. The 8th score in the ordered list was 11 and the 9th score was 15, and the average will be those two scores added together and divided by 2:
The median has several advantages: (1) it is relatively unaffected by extreme scores at either end of the scale (outliers), so if someone behaves really oddly it won’t bias this measure too much (see Box 4.3 for more about outliers); (2) if you get lots of scores at one end of the scale (the distribution is skewed – see page 113) then the median is less affected by this than the mean; and (3) it can be used with ordinal, interval and ratio data (it cannot, however, be used with nominal data as these data have no numerical order).
However, there are disadvantages too. On page 111, I mentioned that the behaviours we observe in different samples will differ slightly (and large samples are more reliable), well, the variability of the median across samples can be considerable (and usually greater than the mean – see the next section). As such it isn’t a very stable measure because it is susceptible to sampling fluctuations. Second, the median is not very useful mathematically because it doesn’t take account of all of the scores in the data set.
The Mean
The mean is the sum of all scores divided by the number of scores, so to calculate the mean we simply add up all of the scores and then divide by the total number of scores we have. Algebraically this is represented as (see Box 4.2 for revision of algebra):
Let’s take the data on the number of hallucinations for which we calculated the median, and now calculate the mean. First we add up all of the scores:
Box 4.2: Revision of Algebra
In the next few chapters you’ll come across several equations. Just so you don’t get lost, here are a few reminders of some symbols that we’ll be using, what these symbols mean, and a few simple rules of mathematics.
Symbols:
Σ, This symbol (called sigma) means ‘add everything up’. So, if you see something like Σx it just means ‘add up all of the scores you’ve collected’.
is just the symbol for the mean of a set of scores. So, if you see = 3.26 you should read this as ‘the mean of these data was 3.26’.
s2 is just the symbol for the variance of a sample (see the next section). If you see s2 = 47.98 you should read this as ‘these data had 47.98 units of variance’. If we are quoting the variance of a population, we use the symbol σ2.
s is just the symbol for the standard deviation of a sample (see the next section). If you see s = 5.67 you should read this as ‘the standard deviation of these data was 5.67’. If we are quoting the standard deviation of a population, we use the symbol σ.
The symbol √ means calculate the square root. So, means 3 (because 3 is the square root of 9). There should be a button on your calculator that allows you to work out square roots.
Rules:
Two negatives make a positive: Although in life two wrongs don’t make a right, in mathematics they do! When we multiply a negative number with another negative number, the result is a positive number. For example, –3 × –4 = 12.
A negative number multiplied by a positive one makes a negative number: If you multiply a positive number by a negative number then the result is another negative number. For example, 3 × –4 = –12, or –3 × 4 = –12.
BODMAS: This is an acronym for the order in which mathematical operations are performed. It stands for Brackets, Order, Division, Multiplication, Addition, Subtraction and this is the order in which you should carry out operations within an equation. Mostly these operations are self-explanatory (e.g. always calculate things within brackets first) except for order, which actually refers to power terms such as squares. Three squared, or 32, used to be called three raised to the order of 2, hence why these terms are called order in BODMAS. If we have a look at an example of BODMAS, what would be the result of 1 + 3 × 52? The answer is 76 (not 100 as some of you might have thought). There are no brackets so the first thing is to deal with the order term: 52 is 25, so the equation becomes 1 + 3 × 25. There is no division, so we can move on to multiplication: 3 × 25, which gives us 75. BODMAS tells us to deal with addition next: I + 75, which gives us 76 and the equation is solved. If I’d written the original equation as (1 + 3) × 52, then the answer would have been 100 because we deal with the brackets first: (1 + 3) = 4, so the equation becomes 4 × 52. We then deal with the order term, so the equation becomes 4 × 25 = 100!
http://www.easymaths.com is a good site for revising basic maths.
We can then divide by the number of scores (in this case seven):
The mean is 3.57, which is not a value we observed in our act
ual data (no-one had 3.57 hallucinations). If we compare this to the median of the same data we can also see that it’s slightly higher (3.57 compared to 3). What might account for this difference? On page 118, I suggested that the median wasn’t heavily affected by extreme scores, well, one disadvantage of the mean is that it can be influenced by extreme scores (see Box 4.3). The other problems with the mean are that it is affected by skewed distributions (see page 113) and can be used only with interval or ratio data.
Having said that, the mean has several important advantages: at the simplest level the mean uses every score (the mode and median ignore most of the scores in a data set); more important, the mean can be easily manipulated algebraically and so is very useful mathematically; the mean is the most accurate summary of the data (see Box 4.4); and finally the mean is resistant to sampling variation. What I mean by this final point is that if you took one sample from a population and measured the mean, mode and median, and then took a different sample from the same population and again measured the mean, mode and median, then the mean is the most likely of the three measures to be the same in the two samples. The mode and median are more likely to differ across samples than the mean and this is very important because we’re usually using samples to infer something about the entire population – so it’s important that our sample is representative of this population.
Measuring the Accuracy of the Mean
The mean is probably the simplest statistical model that we use (see Field, 2000, Chapter 1). By this I mean that it is a statistic that predicts the likely score of a person (if no other data were available and we were asked to predict a given person’s score then the mean would be the best guess because the error in this guess will, on average, be less than any other guess we might make). Also, the mean is a hypothetical value that can be calculated for any data set, but doesn’t have to be a value that is actually observed in the data. So, it’s a summary statistic. A classic example of what I’m talking about is the statistic, frequently quoted in the UK, that the average family has 2.5 children. If we take this literally, then it means that one family has two full-bodied children and one extra pair of legs, whereas their neighbours also have two full-bodied children but have the added pleasure of a child who exists only from the waist up (you have to pity the family that got the lower half as it won’t do much but produce messy substances!). Alternatively, perhaps the children have been cut down the middle so each has one arm one leg and half a head. Unless I’m doing my usual trick of living in a naïve world, I assume this is not the case and that actually most families have 2 or 3 full-bodied little munchkins. The mean value that is always quoted at us is a hypothetical value. As such, the mean is a model created to summarize our data.
Box 4.3: The effect of outliers
The internet company Amazon sells books, music and videos online and lets users provide reviews and ratings of products they have bought (the rating ranges from 1 to 5 stars). Sad person that I am, I looked up the reviews for my first book (Field, 2000) to see what the feedback was like. At the time of writing, seven people had reviewed and rated my book and their ratings were (in the order the ratings were given): 2, 5, 4, 5, 5, 5, 5. All but one of these ratings are fairly similar (mainly 5 and 4) but the first rating was quite different from the rest – it was a rating of 2. The graph shows these ratings plotted as a graph (so there are the seven raters on the horizontal axis and their ratings on the vertical axis). On this graph there is a horizontal line that represents the mean of all seven scores and it’s clear that all of the scores except one lie close to this line. The score of 2 is very different and lies some way below the mean. This score could be termed an outlier – a score that is very different from the rest, or is inconsistent with the bulk of the data. The dotted horizontal line represents the mean of the scores when the outlier is not included. This line is higher than the original mean indicating that by ignoring this score the mean increases. This shows how a single score can bias the mean; in this case it is dragging the average down.
If we calculate the mean with and without the outlier we can see the extent of this bias:
If we now calculate the mean without the outlier (remember we now have only 6 scores when we calculate the mean):
So, there is a difference in the mean of 0.4. Now if we round off to whole numbers (as Amazon sometimes do), that single score has made a difference between the average rating reported by Amazon being a generally glowing 5 stars and the less impressive 4 stars (not that I’m bitter or anything!).
What happens if we look at the median score? First we must arrange the data in numerical order: 5, 5, 5, 5, 5, 4, 2. With the outlier included there are seven scores (n = 7) so the middle score is in position four, (n + 1)/2 = 4, which gives a median of five. If we remove the outlier there are only six scores (n = 6): 5, 5, 5, 5, 5, 4. With the outlier excluded the middle score is in position (n + 1)/2 = 3.5, which will be the average of positions three and four. Both of these positions contain a score of five and so the average will also be five. So, with the outlier excluded the median is five.
These data are very useful in illustrating the effects of an outlier. The mean is always biased by outliers (in this case the outlier reduced the mean), however, the median is not so prone to bias, and often a single extreme score will have little or no effect on its value.
(Data for this example from http://www.amazon.co.uk/)
It’s important with any model that it accurately represents the state of the real world. Engineers, before constructing a building will build many scaled-down models of the building and test them (they stick them in wind tunnels to make sure they don’t fall over and that sort of thing). It’s crucial that these models behave in the same ways as the real building will do, otherwise the tests on the model become a pointless exercise (because they are inaccurate). A direct parallel can be drawn to what is done in statistics: imagine the population is the building we were just discussing, the sample is a scaled-down version of the population and we use it to build models (like the scaled-down models of the building), it’s important that these models are accurate otherwise the inferences we draw about the population (or building) will be inaccurate.
In the previous section I mentioned that the mean was the most accurate model of a data set that we could have. However, it will only ever be a perfect representation of the data if all of the scores we collect are the same. Figure 4.5 demonstrates this point using two data sets representing ratings of two books – ‘obsessions and compulsions’ by Dr. Didi Check (circles) and ‘matrix algebra is fun, honestly’ by Dr. Ted Dius (triangles). Dr. Check’s book has been given five star ratings from all seven customers, and the mean is, therefore, five. You’ll notice that the line representing this mean passes through every rating showing that this mean is a perfect reflection of every point in the data set. However, Dr. Dius’ book has much more variable ratings and the line that represents the mean of these data only passes through two of the seven scores. For the remaining five scores there is a difference between the observed rating and the one predicted by the mean. For these ratings the mean is not a perfect model. If we wanted to work out how representative the mean is of the data, the simplest thing to do would be to look at these differences between the raw scores and the mean. When the mean is a perfect fit of the data these differences are all zero (there is no difference between the mean and each data point). However, unless you test a bunch of clones you’ll never collect a set of psychological data in which all the scores are the same (and if you do you should be worried!). There will always be some differences between the mean (which is typically not a value that actually exists in the raw scores) and the raw scores. It’s fine to have these differences but if the mean is a very good representation of the data then these differences will be small.
Figure 4.5 Graph showing a set of data for which the mean is a perfect fit (circles) and another set of data for which the mean is an imperfect fit (triangles)
Figure 4.6 shows the numb
er of hallucinations that each of seven amphetamine users had in a day, and also the mean number that we calculated earlier on. The line representing the mean is our model, and the circles are the observed data. The diagram also has a series of vertical lines that connect each observed value to the mean. These lines represent the differences between what our model predicts (the mean) and the data we actually collected; these differences are known as deviations. The magnitude of these differences is calculated simply by subtracting the mean value () from each of the observed values (xi)2 For example, the first amphetamine user had only 1 hallucination and so the difference is x1 – = 1 – 3.57 = –2.57. Notice that the difference is a minus number, which tells us that our model overestimates the number of hallucinations that this user had: it predicts that he had 3.57 hallucinations when in reality he had 1. The simplest way to use these deviations to estimate the accuracy of the model would be to add them (this would give us an estimate of the total error). If we were to do this we would find (Table 4.1) that the total deviations add up to zero:
Figure 4.6 A graph showing the differences between the number of hallucinations experienced by each amphetamine user and the mean
Table 4.1
The fact that the total of all deviations is zero indicates that there is no error within the model (the mean is a perfect fit) but we know that it is not a perfect fit because the scores deviate from the mean. The reason why the sum of all deviations equals zero is because the mean is a measure of the centre of the distribution. Therefore, about half of the scores will be greater than the mean and about half will be less. Consequently, about half of the deviations will be positive and about half will be negative; so when we add them up, they cancel each other out, giving us a total of zero. To overcome this problem we square each of the deviations (Table 4.2) so that they are all positive (the negative ones will become positive when squared because you are multiplying two negative numbers – see Box 4.2).