by Andy Field
Therefore, the null hypothesis is highly unlikely.
If we go back to a football example we could get a similar statement:
• If a man plays football then he probably doesn’t play for England (this is true because there are thousands of people who play football and only a handful make it to the dizzy heights of the England squad!). Phil Neville plays for England.
Therefore, Phil Neville probably doesn’t play football.
Now although at first glance this seems perfectly logical (Phil Neville certainly doesn’t play football in the conventional sense of the term) it is actually completely ridiculous – the conclusion is wrong because Phil Neville is a professional footballer (despite his best attempts to prove otherwise!). This illustrates a common fallacy in hypothesis testing. In fact hypothesis testing allows us to say very little about the null hypothesis.
Type II error: This is when we believe that our experimental manipulation has failed, when in reality it hasn’t. This would occur when in the real world the experimental manipulation does have an effect, but we obtain a small test statistic (perhaps because there is a lot of natural variation between our samples). So, the sample means appear to be quite similar even though the experiment has made a difference to the samples. In an ideal world, we want the probability of this error to be very small (if the experiment does have an effect then it’s important that we can detect it). Cohen (1992) suggests that the maximum acceptable probability of a Type II error would be .2 (or 20%) – this is called the β-level. That would mean that if we took 100 samples for which the experiment had genuinely had an effect then we would fail to find an effect in 20 of those samples (so we’d miss I in 5 genuine effects).
There is obviously a trade off between these two errors: if we lower the probability of accepting an effect as genuine (make α smaller) then we increase the probability that we’ll reject an effect that does genuinely exist (because we’ve been so strict about the level at which we’ll accept that an effect is genuine). The exact relationship between the Type I and Type II error is not straightforward because they are based on different assumptions: a Type I error assumes there is no effect in the population whereas a Type II error assumes there is an effect. So, although we know that as the probability of making a Type I error decreases, the probability of making a Type II error increases, the exact nature of the relationship is usually left for the researcher to make an educated guess (Howell, 2001, pp. 104–107 gives a great explanation of the trade off between errors).
Effect Sizes
The framework for testing hypotheses that I’ve just presented has a few problems, most of which have been briefly explained in Box 5.2. There are several remedies to these problems. The first problem we encountered was knowing how important an effect is: just because a test statistic is significant, doesn’t mean that the effect it measures is meaningful or important. The solution to this criticism is to measure the size of the effect that we’re testing. When we measure the size of an effect (be that an experimental manipulation or the strength of relationship between variables) it is known as an effect size. An effect size is simply an objective and standardized measure of the magnitude of observed effect. The fact that the measure is standardized just means that we can compare effect sizes across different studies that have measured different variables, or have used different scales of measurement (so an effect size based on speed in milliseconds could be compared to an effect size based on heart rates). Many measures of effect size have been proposed, but the most common one is Pearson’s correlation coefficient (see Field, 2001). Many of you will be familiar with the correlation coefficient as a measure of the strength of relationship between two variables, however, it is also a very versatile measure of the strength of an experimental effect. It’s a bit difficult to reconcile how the humble correlation coefficient can also be used in this way, however, this is only because students are typically taught about it within the context of non-experimental research. The reason why the correlation coefficient can be used to measure the size of experimental effects is easily understood once you’ve read Box 5.1, in which I explain that the proportion of total variance in the data that can be explained by the experiment is equal to r2. r-Square is a proportion and so must be a value that lies between 0 (meaning the experiment explains none of the variance at all) and 1 (meaning that the experiment can explain all of the variance); the bigger the value, the bigger the experimental effect. So, we can compare different experiments in an objective way: by comparing the proportion of total variance for which they can account. If we take the square root of this proportion, we get the Pearson correlation coefficient, r, which is also constrained to lie between 0 (no effect) and I (a perfect effect).1
The useful thing about effect sizes is that they provide an objective measure of the importance of the experimental effect. So, it doesn’t matter what experiment has been done, what outcome variables have been measured, or how the outcome has been measured we know that a correlation coefficient of 0 means the experiment had no effect, and a value of I means that the experiment completely explains the variance in the data. What about the values in between? Luckily, Cohen (1988, 1992) has made some widely accepted suggestions about what constitutes a large or small effect:
r = 0.10 (small effect): in this case the effect explains I % of the total variance.
r = 0.30 (medium effect): the effect accounts for 9% of the total variance.
r = 0.50 (large effect): the effect accounts for 25% of the variance.
We can use these guidelines to assess the importance of our experimental effects (regardless of the significance of the test statistic). However, r is not measured on a linear scale so an effect with r = .6 isn’t twice as big as one with r = .3! Such is the utility of effect size estimates that the American Psychological Association is now recommending that all psychologists report these effect sizes in the results of any published work. So, it’s a habit well worth getting into.
A final thing to mention is that when we calculate effect sizes we calculate them for a given sample. Now, when we looked at means in a sample we saw that we used them to draw inferences about the mean of the entire population (which is the value in which we’re actually interested). The same is true of effect sizes: the size of the effect in the population is the value in which we’re interested, but because we don’t have access to this value, we use the effect size in the sample to estimate the likely size of the effect in the population (see Field, 2001).
Statistical Power
We’ve seen that effect sizes are an invaluable way to express the importance of a research finding. The effect size in a population is intrinsically linked to three other statistical properties: (1) the sample size on which the sample effect size is based; (2) the probability level at which we will accept an effect as being statistically significant (the α level); and (3) the power of the test to detect an effect of that size. As such, once we know two of these properties, then we can always calculate the remaining one. It will also depend on whether the test is one- or two-tailed (see Box 5.3). Typically, in psychology we use an α level of .05 (see earlier) so we know this value already. The power of a test is the probability that a given test will find an effect assuming that one exists in the population. If you think back to page 151 you might remember that we’ve already come across the probability of failing to detect an effect when one genuinely exists (β, the probability of a Type II error). It follows that the probability of detecting an effect if one exists must be the opposite of the probability of not detecting that effect (i.e. 1 – β). We saw on page 152 that Cohen (1988, 1992) suggests that we would hope to have a .2 probability of failing to detect a genuine effect, and so the corresponding level of power that he recommended was 1 – .2, or .8. We should aim to achieve a power of .8, or an 80% chance of detecting an effect if one genuinely exists. The effect size in the population can be estimated from the effect size in the sample, and the sample size is determined by the experimenter anyway so t
hat value is easy to calculate. Now, there are two useful things we can do knowing that these four variables are related:
Box 5.3: Tests have tails
When we conduct a statistic test we do one of two things: (1) test a specific hypothesis such as ‘when people are depressed they eat more chocolate’, or (2) test a non-specific hypothesis such as ‘men and women differ in the amount of chocolate they eat when they’re depressed’. The former example is directional: we’ve explicitly said that people eat more chocolate when depressed (i.e. the mean amount of chocolate eaten when depressed is more than the mean amount eaten when not depressed). If we tested this hypothesis statistically, the test would be known as a one-tailed test. The second hypothesis is non-directional: we haven’t stated whether men or women eat more chocolate when they’re depressed, we’ve just said that men will be different from women. If we tested this hypothesis statistically, the test would be known as a two-tailed test.
Figure 5.3 Diagram to show the difference between one- and two-tailed tests
Imagine we wanted to discover whether men or women ate more chocolate when depressed. We took two groups: one group of depressed males and one group of depressed females. We then measured the amount of chocolate they both ate. If we have no directional hypothesis then there are three possibilities: (1) depressed men eat more chocolate than depressed women, therefore, the mean for men is larger than the mean for women and so the difference (mean for men minus the mean for women) is positive; (2) depressed men eat less chocolate than depressed women, therefore, the mean for men is smaller than the mean for women and so the difference (mean for men minus the mean for women) is negative; (3) there is no difference between depressed men and women, their means are the same and so the difference (mean for men minus the mean for women) is exactly zero. This final option is the null hypothesis. The direction of the test statistic (i.e. whether it is positive or negative) will depend on whether the difference is positive or negative. Assuming we do get a difference, then to detect this difference we have to take account of the fact that the mean for males could be bigger than for females (and so derive a positive test statistic) but that the mean for males could be smaller and so derive a negative test statistic. If, at the .05 level we need to get a test statistic bigger than say 5, what do you think would happen if the test statistic is negative? Well, put simply it will be rejected even though a difference does exist. To avoid this we have to look at both ends (or tails) of the distribution of possible test statistics. This means that to keep our criterion probability of .05 we have to split this probability across the two tails: so we have .025 at the positive end of the distribution and .025 at the negative end. Figure 5.3 shows this situation – the grey areas are the areas above the test statistic needed at a .025 level of significance. Combine the probabilities at both ends and we get .05 – our criterion value. Now if we have made a prediction, then we basically put all our eggs in one basket. So, we might say that we predict that men eat more chocolate than women and so we’re only interested in finding a positive difference – we only look at the positive end of the distribution. This means that we can just look for the value of the test statistic that would occur by chance with a probability of .05. Figure 5.3 shows this situation – the area containing diagonal lines is the area above the test statistic needed at a .05 level of significance. The point to take home is that if we make a one-tailed prediction then we need a smaller test statistic to find a significant result (because we are looking in only one tail).
Calculate the power of a test: Given that we’ve conducted our experiment, we will have already selected a value of α, we can estimate the effect size based on our sample, and we will know how many participants we used. Therefore, we can use these values to calculate β, the power of our test . If this value turns out to be .8 or more we can be confident that we achieved sufficient power to detect any effects that might have existed, but if the resulting value is less, then we might want to replicate the experiment using more participants to increase the power.
Calculate the sample size necessary to achieve a given level of power: Given that we know the value of α and β, we can use past research to estimate the size of effect that we would hope to detect in an experiment. Even if no-one had previously done the exact experiment that you intend to do, we can still estimate the likely effect size based on similar experiments. We can use this estimated effect size to calculate how many participants we would need to detect that effect (based on the values of α and β that we’ve chosen).
The most common use is the latter: to determine how many participants should be used to achieve the desired level of power. The actual computations are very cumbersome, but fortunately, there are now computer programs available that will do them for you (one example being nQuery Adviser – see Field, 1998b for a review). Also, Cohen (1988) provides extensive tables for calculating the number of participants for a given level of power (and vice versa). Based on Cohen (1992) we can use the following guidelines: if we take the standard α level of .05 and require the recommended power of .8, then we need 783 participants to detect a small effect size (r = .1), 85 participants to detect a medium effect size (r = .3) and 28 participants to detect a large effect size (r = .5).
5.2 Summary
* * *
This chapter has been a bit of a crash course in the theory of inferential statistics (hopefully crash is not too ominous a word to use!). We saw that essentially we can partition variance in our scores into two sorts: (1) variance we can explain with our experimental manipulation or manipulations, and (2) variance we can’t explain. Most test statistics essentially compare the variance explained by a given effect against the variance in the data that it doesn’t explain. This led us to consider how we might gain confidence that a particular experimental manipulation has explained enough variance for it to reflect a genuine effect. Setting levels for this confidence involves a trade off between accepting effects that are in reality untrue (a Type I error) and rejecting effects that are in fact true (a Type II error). We discussed the importance of quantifying effects in a standard form using effect sizes, and how in turn we could use effect sizes to educate us about the power of our statistical tests and the sample size that we might need to achieve appropriate levels of power to detect an effect.
5.3 Practical Tasks
* * *
Some study questions:
What is an effect size and why is it important?
Should we always rely on a .05 level of significance to determine whether we believe a effect is genuine or not?
What can we conclude from a significance test?
5.4 Further Reading
All of the following are lucid accounts of the main issues in this chapter, by one of the best statistics writers we’ve had in psychology.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45 (12), 1304–1312.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112 (1), 155–159.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49 (12), 997–1003.
Notes
1 The correlation coefficient can also take on minus values (but not below –1) because when these values are squared the minus signs cancel out and become positive. This is useful when we’re measuring a relationship between two variables because the sign r tells us about the direction of the relationship (see Field, 2000, pp. 74–75), but in experimental research the sign of r merely reflects the way in which the experimenter coded their groups (see Field, 2000, pp. 93–95).
6 Parametric Statistics
* * *
In the last chapter, we looked at why we use inferential statistics. The next two chapters run through the most common inferential statistical tests that are used in the social sciences. For each test there’ll be a brief rationale of how the test works (but luckily not a lot of maths!); what kind of output you can expect to find from SPSS (the most popular statistical package in the social sciences); how y
ou should interpret that output (so, what are the key features of the output to look out for?), and how to report the findings in APA format. This chapter concentrates on parametric tests, which are tests that are constrained by certain assumptions (I’ll discuss these shortly), whereas the following chapter describes tests that do not depend upon these assumptions (non-parametric tests).
6.1 How Do I Tell If My Data are Parametric?
* * *
Parametric tests work on the arithmetic mean and so data must be measured at an interval or ratio level (see page 8) otherwise the mean just doesn’t make sense. This requirement has to be looked at subjectively and with a bit of common sense. In psychology we rarely have the luxury of knowing the level at which we’ve measured (see Box 1.1) but we can take an educated guess. Parametric tests also require assumptions about the variances between groups or conditions. When we use different participants the assumption is basically that the variance in one experimental condition is roughly the same as the variance in any other experimental condition. So, if we took the data in each experimental condition and calculated the variance, we would expect all of the values to be roughly the same. This is called the assumption of homogeneity of variance. This assumption makes intuitive sense because we saw in Chapter 4 that the variance was a measure of the accuracy of the mean. If we’re using tests that compare means in different conditions then we would like these means to be equivalently accurate. A similar assumption in repeated measures designs is the sphericity assumption (which I’ll discuss in detail on page 183). The assumptions of homogeneity of variance and sphericity can be easily tested using Levene’s test (see page 165) and Mauchly’s test (see page 184) respectively. However, you should bear in mind that the power of these tests will depend upon the sample sizes (see page 154) and so they can’t always be trusted – especially in small samples.