How to Design and Report Experiments
Page 17
4.3 Confidence Intervals
* * *
We’ve just seen that if we collected lots of samples from the same population and calculated their means then it would be possible to plot the frequency distribution of those means, and by using the standard deviation of this distribution (the standard error) we could get a good idea of how accurately a particular sample mean represents the population. We could take a different approach to assessing the accuracy of sample means and that is to calculate boundaries within which most sample means will fall. Imagine that we collected 100 samples of data regarding the number of units necessary before a man will snog a Labrador, and for each we calculated the mean. From these means (and the resulting sampling distribution) we could calculate the boundaries within which those samples lie. It might be that the lowest mean is 2 units and the highest mean is 18 units. Now, based on this finding we could say that we were 100% confident that the mean of any other samples we draw from that population will also fall within these limits (it will be greater than or equal to 2 units and less than or equal to 18 units). Now usually we’re not interested in 100% confidence and so we will usually, as psychologists, be content with being 95% confident. In this case we could calculate the limits within which 95 of our 100 samples will fall. These limits may be slightly less, say, 3 and 17 units. These boundaries are known as a confidence interval. Typically we look at 95% confidence intervals, and sometimes 99% confidence intervals, but they all have a similar interpretation: they are the limits within which a certain percentage (be that 95% or 99%) of sample means will fall. So, when you see a 95% confidence interval for a mean think of it like this: if we’d collected 100 samples, then 95 of these samples would have a mean within the boundaries of the confidence interval. The confidence interval can easily be calculated once the standard error is known because the lower boundary of the confidence interval is the mean minus two times the standard error, and the upper boundary is the mean plus two standard errors. As such the mean is always in the centre of the confidence interval.
If the mean represents the data well, then the confidence interval of that mean should be small indicating that 95% of sample means would be very similar to the one obtained. If the sample mean is a bad representation of the population then the confidence interval will be very wide indicating that a different sample might produce a mean quite different to the one we have. We’ll talk more about confidence intervals in the next chapter.
4.4 Reporting Descriptive Statistics
* * *
The final thing we need to have a look at is how to report descriptive statistics. Typically, when we do experimental research we’re interested in reporting the mean of one or more samples. We have a choice of either reporting this mean in words or graphically. When we present the mean in words, we usually will include information about the accuracy of that mean. So, we might simply report the mean followed by the standard deviation. However, because we’re usually interested in how well the mean represents the population and not how well it represents the sample, we should actually report the standard error (or less commonly a confidence interval). In terms of format, most psychology journals have adopted the conventions laid out by the American Psychological Association (or APA for short) in their publications manual, which is now in its fifth edition (APA, 2001). Usually, we report any numbers to two decimal places and there are various standard abbreviations for statistical values:
M = Mean
Mdn = Median
SE = Standard Error
SD = Standard Deviation
Let’s have a look at some examples of how to report a mean and the associated standard error in correct APA format. We can simply state the mean within a sentence and parenthesize the standard error:
The mean number of units drunk before snogging Ben the Labrador was 10.00 units (SE = 1.22).
On average the Labradors had to be given 26.65 units of alcohol (SE = 3.42) before they would play tonsil hockey with any of the blokes.
However, it’s more common to parenthesize both the mean and standard error within a sentence that would otherwise make sense even if these parenthesized values were excluded:
Women needed substantially more units of alcohol (M =17.24, SE = 2.53) than men (M =10.00, SE =1.22) before they would exchange saliva with a Labrador.
Although Labradors would lick the participant’s feet after very little alcohol (M = .28, SE = .11), they needed considerably more before they would do the tongue tango with the men (M = 26.65, SE = 3.42).
The second approach is to report means by using a graph and then not report the actual values in the text. The choice of using a graph or writing the values in the text largely depends on what you want to report. If you just have one or two means you might decide that a graph is superfluous, but when you have lots of means to report, a graph can be a very useful way to summarize the data. If you do use a graph then you shouldn’t really report values of the mean in the text (the assumption is that the reader can determine the value of the mean from the graph). There are two types of graph we use to illustrate means: a bar chart and a line chart. In both cases it is very important to include error bars, which are vertical bars indicating the size of the standard error. Figure 4.10 shows an example of a bar chart and a line graph. The bar chart shows the mean units of alcohol that men or women would have to drink before kissing a dog; the gender of the person being kissed is displayed as different bars. Each bar represents the mean, but each mean is also spanned by a funny ‘I’ shape; this is called an error bar.
Error bars can represent many things: the standard deviation of the sample, the standard error estimated from the sample, or a confidence interval for the mean. The error bars in Figure 4.10(a) display the 95% confidence interval of the mean of each experimental condition (in this case whether a man or woman was kissing the dog). I explained previously that if we were to take 100 samples from a population and calculate the mean of each, then the confidence interval represents the limits within which 95 of those 100 means would fall. Looking at the confidence interval on the bar labelled male in Figure 4.10 we see it ranges from about 9 to 11, with the mean being about halfway between at 10 units of alcohol. This confidence interval is said to have a lower limit of 9 and an upper limit of 11, and if we took 100 samples from the same population and calculated the mean, 95 of these means would lie between 9 and 11 (the remaining 5 means would lie outside of these limits). An error bar graph, therefore, displays the limits within which the majority of sample means (from the same population) lie. You may notice that the error bar for females is much wider than the bar for males, which indicates that this mean is more variable across samples. So, although men are fairly consistent in the quantity they need to drink before kissing a dog (in 95% of samples it will be between 9 and 11 units), women vary a lot more (in 95% of samples a women will need between 13 and 21 units). As such, the sample of the men is much more representative of its population.
Figure 4.10 Examples of how to display means using a bar chart (a) and a line graph (b)
Figure 4.10(b) shows a line chart, which shows the average number of units of alcohol that a dog would need to drink before either kissing a man or woman, or licking their feet. Again, the gender of the person being kissed or having their feet licked is shown on the x-axis (the horizontal), but because we have more than one activity (we have foot licking as well as kissing) we can represent these activities as different lines. We have four means in total and each one has an error bar, and we use the lines to connect means that relate to the same activity. This allows us to examine different patterns in the data for different activities. If we look at foot licking first, we can see that dogs require relatively little alcohol before they will lick either a male or a female’s feet (the line is quite flat). This is because dogs are sick and vulgar animals that love to lick smelly feet, unlike cats, which have far more sense and lick only catnip sweets! If we look at kissing then we see that for females there is again relatively little alco
hol required, yet to kiss a man, dogs will require a huge amount of alcohol (and who can blame them!). In fact, the average dog would rather lick a man’s feet than kiss him. Nevertheless, this graph illustrates how useful line charts can be to show trends in the data. In fact you’ve just looked at your first interaction graph and we’ll be talking about these kinds of graphs in Chapter 6.
4.5 Summary
* * *
In this chapter we have started to look at how we use samples to tell us things about a larger population. We’ve also seen that we can summarize our data in several ways. First, we can use graphs that show how often each score occurs (frequency distributions) and these graphs often have distinct shapes (they can be symmetrical or skewed, and flat or pointy). Second, we looked at numeric ways to summarize samples of data. These simple models of the sample can take many forms such as the most frequent score (the mode), the middle score in the distribution (the median) and the score that produces the least squared error between the data points and that value (the mean). The mean represents the typical score and because it is a model that uses all of the data collected we need to ascertain how well it fits the data we’ve collected. This is done by looking at the errors between the mean and all of the scores; we square these errors and add them up to give us the sum of squared errors. This value will depend on the number of scores so we can divide it by the number of scores (actually it’s N – 1) collected to give us the mean squared error, or variance. If we take the square root of this value we get the standard deviation, which is a measure of how accurate the mean is; it tells us whether the scores in our sample are all close to the mean (small SD) or very different from the mean (big SD). Finally, we can look at how well the sample mean represents the population by looking at the standard error, which could be calculated by taking lots of samples and calculating their mean and then working out the standard deviation of these sample means.
4.6 Practical Tasks
* * *
Now, using the results sections in Chapters 13 and 16, think about the following:
Have the means been presented correctly?
Could the descriptive statistics be presented better?
For each graph write down the mean value in each condition and comment on the error bar (is it big or small and what does this tell us about the mean?).
4.7 Further Reading
Rowntree, D. (1981). Statistics without tears: a primer for non-mathematicians. London: Penguin. Still one of the best introductions to statistical theory (apart from this book obviously!).
Wright, D. B. (2002). First steps in statistics. London: Sage. Chapters 1 and 3 are very accessible introductions to descriptive statistics.
Notes
1 In fact this would be a dangerous conclusion because the sample is still very small. However, Tversky and Kahneman (1971) note that we often have very exaggerated confidence in conclusions based on small samples if everyone in that sample behaves in the same way.
2 The xi is the observed score for the ith person. The i could be replaced with a number that represents a particular individual or could even be replaced with a name, so if Dan was the sixth amphetamine user then, xi = xDan = x6 = 8.
3 In fact it should be the population standard deviation (σ) that is divided by the square root of the sample size, however, for large samples this is a reasonable approximation.
5 Inferential Statistics
* * *
Describing a single sample and the population from which it came is fine but it doesn’t really help us to answer the types of research questions that we generated in Chapter 2. To answer research questions we typically need to employ inferential statistics – so called because they allow us to infer things about the state of the world (or the human mind if you’re a psychologist).
5.1 Testing Hypotheses
* * *
Scientists are usually interested in testing hypotheses, which put more simply means that they’re trying to test the scientific questions that they generate. Within most research questions there is some kind of inherent prediction that the researcher has made. This prediction is called an experimental hypothesis (it is the prediction that your experimental manipulation will have some effect). Of course, there is always the reverse possibility – that your prediction is wrong and the experiment has no effect – and this is called the null hypothesis. If we look at some of the research questions from Chapter 2 we can see what I mean:
Alcohol makes you fall over: the experimental hypothesis is that those that drink alcohol will fall over more than those that don’t drink alcohol; the null hypothesis is that people will fall over the same amount regardless of whether they have drunk alcohol.
Children learn more from interactive CD-ROM teaching tools than from books: the experimental hypothesis is that learning is better when CD-ROMs are used than when books are used, and the null hypothesis would be that learning is the same regardless of which method is used.
Frustration creates aggression: the experimental hypothesis is that if we frustrate people then their aggression will increase, whereas the null hypothesis is that aggression will remain the same regardless of whether the person is frustrated.
Men and women use different criteria to select a mate: the experimental hypothesis is that if we compare men and women’s criteria for mate selection they will be different, but the null hypothesis would be that males and females use the same criteria.
Depressives have difficulty recalling specific memories: the experimental hypothesis is that depressives cannot recall specific memories as easily as non-depressed people, and the null hypothesis would be that depressed and non-depressed people recall specific memories with equal ease.
Inferential statistics tell us whether the experimental hypothesis is likely to be true. So, these statistical tests help us to confirm or reject our experimental predictions. Of course, we can never be completely sure that either hypothesis is correct, and so we end up working with probabilities. Specifically we calculate the probability that the results we have obtained are a chance result – as this probability decreases, we gain greater confidence that the experimental hypothesis is actually correct and that the null hypothesis can be rejected. We’ve already come across this idea on page 25 where we saw how the tea tasting abilities of an old lady could be tested by setting her increasingly harder challenges. Only when there was a very small probability that she could complete the task by luck alone would we conclude that she had genuine skill in detecting whether milk was poured into a cup before or after the tea was added. In this earlier section, we also mentioned that Fisher suggested that we should use 95% as a threshold for confidence: only when we are 95% certain that a result is genuine (i.e. not a chance finding) should we accept it as being true. The opposite way to look at this is to say that if there is only a 5% probability of something occurring by chance then we can accept that it is a true finding. This criterion of 95% confidence forms the basis of modern statistics and yet there is very little justification for it other than Fisher said so (and to be fair he was a very clever bloke!). Nevertheless, I’ve often wondered how psychology would look today if Fisher had woken up that day in a 90% kind of a mood. Typically research journals have a bias towards publishing positive results (in which the experimental hypothesis is supported) and so just imagine how many theories have been lost because researchers were only 94% confident that the data supported their ideas? Had Fisher woken up in a 90% mood that morning, many different theories might have reached public attention and we might have very different models of the human mind. We might have found that cats are the most intelligent beings on this planet, that statistics lecturers are really interesting people who captivate social gatherings with their wit and charm, or that Freud was actually right about castration anxiety – OK, maybe the last one is stretching things too far!
Gaining Confidence about Experimental Results
To understand how we can apply Fisher’s ideas about statistical confidence to an experiment
let’s stick with the simplest experimental design. In this design you have only two conditions: in one of them you do nothing (the control condition) and in the other you manipulate your causal variable in some way (see Chapters 1 and 3 for more detail). For example, imagine we were interested in whether having a brain affected a person’s ability to give statistics lectures (see Figure 5.1). We could test this using two groups of lecturers: one group has nothing done to them (the control condition), whereas a second group has their brains removed (the experimental group). The outcome we are interested in is the students’ average rating of each lecturer’s skills as a lecturer. If we did nothing to both groups, then they would just be two samples from a population of statistics lecturers. We have already seen that samples from the same population typically have very similar means (see page 132). However, the two sample means might be slightly different because we’d expect to find some variation between the lecturers in the two samples simply because lecturers in the samples will differ in their ability, motivation, intelligence, enthusiasm and more importantly their love of statistics! (You may have noticed that lecturers vary naturally in their lecturing skills). To sum up, if we were to take two samples of lecturers from the same population we’d expect the sample means to be roughly the same, although some subtle difference will exist. Now, when we talked about sampling distributions earlier (see page 132) we saw that the vast majority of sample means will be fairly similar to the population mean, and only very few samples will be substantially different. As such, if we found that the means in our two samples of lecturers were substantially different this would be a very rare occurrence: it would happen only if, by chance alone, we selected a sample of very good, or very bad, lecturers that were not representative of the population as a whole.