by Andy Field
Example: Imagine Twaddle and Sons, the publishers of ‘Women are from Bras, men are from Penis’, were upset about my claims that their book was about as useful as a paper umbrella. They decided to take me to task and design their own experiment in which participants read their book, and this book (Field & Hole) at different times. Relationship happiness was measured after reading each book. To maximize their chances of finding a difference they used a sample of 500 participants, but got each participant to take part in both conditions (they read both books). The order in which books were read was counterbalanced (see Chapter 3) and there was a delay of 6 months between reading the books. They predicted that reading their wonderful contribution to popular psychology would lead to greater relationship happiness than reading some dull and tedious book about experiments.
SPSS Output for the Dependent t-Test
SPSS Output 6.3 shows the output from a dependent t-Test. The first of the three tables produces descriptive statistics; for each condition we are told the mean, the number of participants (N) and the standard deviation of the sample. In the final column we are told the standard error (this is calculated in the same way as for the independent t-test – see the previous section). The next table reports the Pearson correlation between the two conditions. When repeated measures are used it is possible that scores from the two experimental conditions will correlate (because the data in each condition come from the same people and so there should be some constancy in their responses). SPSS provides the value of Pearson’s r and the two-tailed significance value. For these data the experimental conditions yield a fairly weak correlation coefficient (r = .117) but it is highly statistically significant because p < .01. I shall come back to this point later.
SPSS Output 6.3
The final table is the most important because it tells us whether the difference between the means of the two conditions was large enough to not be a chance result. First, the table tells us the mean difference between conditions (this value is the difference between the mean scores of each condition: 20.01 – 18.49 = 1.52). The table also reports the standard deviation of the difference between the means and more important the standard error of the differences between participants scores in each condition. This standard error gives us some idea of the range of differences between sample means that we could expect by chance alone. The test statistic, t, is calculated by dividing the mean of differences by the standard error of differences (t = 1.52/0.56 = 2.71). The fact that the I-value is a positive number tells us that condition 1 (‘Women are from Bras, men are from Penis’) had a bigger mean than the second (Field & Hole) and so relationship happiness was less after reading our little book. The size of t is compared against known values based on the degrees of freedom. When the same participants have been used in both conditions, the degrees of freedom are simply the sample size minus 1 (df = N – 1 = 499). SPSS uses the degrees of freedom to calculate the exact probability that a value of t as big as the one obtained could occur by chance. This probability value of t is in the column labelled Sig., and this value (.007) is considerably smaller than Fisher’s criterion of .05. We could, therefore, say that there is a very significant (statistically speaking) difference between happiness scores after reading ‘Women are from Bras, men are from Penis’ compared to after reading our lovely book. We saw in the independent t-test that SPSS provides only the two-tailed probability, which is the probability when no prediction was made about the direction of group differences (see Box 5.3). In this case, Twaddle and Sons predicted that their book would make people happier and so they predicted that mean happiness after reading ‘Women are from Bras, men are from Penis’ would be higher than after reading our book. Looking at the table of descriptives the means are consistent with this prediction and because of this we can use the one-tailed probability value obtained by dividing the two-tailed probability by 2 (see Box 5.3). As such, this t-statistic is actually significant at p = .0035 (= .007/2).
Calculating the Effect Size for the Dependent t-Test
We can use the same equation as for the independent t-test to convert our t-value into an r-value (see page 166). We know the value of t and the df from the SPSS output and so we can compute r as follows:
If you think back to our benchmarks for effect sizes this represents a small effect (it is just above .1 – the threshold for a small effect). Therefore, although this effect is highly statistically significant, the size of the effect is very small and so represents a trivial finding.
Interpreting and Writing the Result for the Dependent t-Test
In this example, it would be tempting for Twaddle and Sons to conclude that their book produced significantly greater relationship happiness than our book. In fact, many researchers would write conclusions like this:
The results show that reading ‘Women are from Bras, men are from Penis’ produces significantly greater relationship happiness than that book by smelly old Field & Hole. This result is highly significant.
However, to reach such a conclusion is to confuse statistical significance with the importance of the effect. By calculating the effect size we’ve discovered that although the difference in happiness after reading the two books is statistically very different, the size of effect that this represents is very small indeed. So, the effect is actually not very significant in real terms. A more correct interpretation might be to say:
The results show that reading ‘Women are from Bras, men are from Penis’ produces significantly greater relationship happiness than that book by smelly old Field & Hole. However, the effect size was small, revealing that this finding was not substantial in real terms.
Of course, this latter interpretation would be unpopular with Twaddle and Sons who would like to believe that their book had a huge effect on relationship happiness. This is possibly one reason why effect sizes have been ignored: because researchers want people to believe that their effects are substantive. In fact, the two examples we’ve used to illustrate the independent and dependent t-test are deliberately designed to illustrate an important point: it is possible to find a statistically small effect (has a relatively large p-value) that is actually very substantial (has a large effect size), conversely it is possible to find a highly significant statistical result that is fairly unimportant (has a small effect size). The independent t-test example had a p-value close to our criterion of .05, yet represented a large effect, whereas the dependent t-test example had a p-value much smaller than the criterion of .05 and yet represented a much smaller effect. The difference between these two situations was the sample size (200 and 500) on which the tests were based. For the dependent t-test the significance of the test came about because the sample size was so big that it had the power to detect even a very small effect. In the independent t-Test example, there were only 20 participants in total and so the test only had power to detect a very strong effect (had the effect size been smaller the test would have yielded a non-significant result).
In the previous example we saw that when we report a t-test we always include the degrees of freedom (in this case 499), the value of the t-statistic, and the level at which this value is significant. In addition we should report the effect size estimate:
On average, the reported relationship happiness after reading ‘Women are from Bras, men are from Penis’ (M = 20.02, SE = .45), was significantly higher than after reading Field & Hole’s book (M = 18.49, SE = .40), t(499) = 2.71, p < .01, r = .12.
On the other hand, if I’d displayed the means as a graph I could write:
All effects will be reported at a .05 level of significance. The reported relationship happiness after reading ‘Women are from Bras, men are from Penis’, was significantly higher than after reading Field & Hole’s tedious opus, t(499) = 2.71, r = .12. The effect size estimate indicates that the difference in relationship happiness created by the reading material was a small, and therefore unsubstantial, effect.
6.5 Analysis of Variance
* * *
The t-Tes
t is limited to situations in which there are only two levels of the independent variable (e.g. two experimental groups), but often we run experiments in which there are three or more levels of the independent variable. The easiest way to look at such a situation would be to compare pairs of experimental conditions using lots of t-tests. However, every time we conduct a t-Test we do so assuming a 5% chance that we might accept an experimental effect that isn’t actually real (a Type I error). If we do lots of tests on the same data set these errors add up (this is known as inflating the Type I error rate), so that even if we do only 2 tests the error rate for that data set is greater than 5% (for more detail see Field, 2000, pp. 243–244). To ensure that the 5% Type I error rate is maintained for a given experiment, we have to use a test that looks for an overall experimental effect using a single test (such tests are sometimes called omnibus tests – which doesn’t mean they’re repeated on television on a Sunday afternoon!). Analysis of variance (or ANOVA) is such a test and as well as controlling the overall Type I error rate it also can be used to analyse situations in which there is more than one independent variable (we’ll look at these situations later in this section).
When we perform a t-test, our null hypothesis is that the two samples have roughly the same mean. ANOVA extends this idea by testing the null hypothesis that three or more means are roughly equal.3 Like the t-Test, ANOVA produces a test statistic, the F-ratio, which compares the amount of systematic variance in the data (SSM) to the amount of unsystematic variance (SSR) – see Box 5.1. However, because ANOVA tests for an overall experimental effect there are things that it cannot tell us: although it tells us whether the experimental manipulation was generally successful, it does not provide specific information about which groups differ. Assuming an experiment was conducted with three different groups, the F-ratio simply tells us that the means of these three samples are not equal. However, there are several ways in which the means can differ: (1) all three sample means are significantly different; (2) the means of group I and 2 are the same but group 3 has a significantly different mean from both of the other groups; (3) groups 2 and 3 have similar means but group I has a significantly different mean; and (4) groups 1 and 3 have similar means but group 2 has a significantly different mean from them both. So, the F-ratio tells us only that the experimental manipulation has had some effect, but it doesn’t tell us specifically what the effect was. To discover where the effect lies we have to follow up the ANOVA with some additional tests. There are two things we can use: planned comparisons or post hoc tests. Planned comparisons are used when you’ve made specific predictions about which group means should differ before you collected any data. How these contrasts are constructed is beyond the scope of this book, but I give a really detailed account in Chapter 7 of ‘Discovering statistics’ (Field, 2000, pp. 258–270). Post hoc tests are done after the data have been collected and inspected. These tests compare every experimental condition with every other condition, but are calculated in such a way that the overall Type I error rate is controlled at 5% despite the fact that lots of tests have been done. The simplest way to think of this is that it’s like doing lots of t-tests but being very strict about the cut-off point you use for accepting them as statistically significant. The simplest way to control the Type I error is to use what’s called a Bonferroni correction. This correction is made by simply dividing α, the probability of a Type I error (i.e. .05), by the number of tests you have done. So, if you have done two tests, the new α becomes .05/2 = .025, if you’ve done 5 tests then α = .05/5 = .0 I, and if you’ve done 10 tests then α = .05/10 = .005. Once you’ve done each test, you look at the probability of obtaining the test statistic by chance (just like you normally would) but only accept the result as significant if it is less than the corrected value of α. So, if we have done 2 tests, then we accept a result as significant not if it is less than .05, but only if it is less than .025. Although there are other corrections that can be used (see Field, 2000, Chapter 7), the Bonferroni correction is easily understood and, although slightly conservative, a good correction so it’s the one I’ll tend to use in this book.
6.6 One-Way Independent ANOVA
* * *
You should plan to use one-way independent ANOVA when you are going to test three or more experimental groups (you will compare 3 or more means), and different participants will be used in each group (so each person will contribute only one score to the data).
The name of a particular ANOVA gives away the situation in which it is used. Every ANOVA begins its name with some reference to the word ‘way’; this word can be read as ‘independent variable’ and the number that precedes it tells us how many independent variables were manipulated in the experiment. So, a one-way ANOVA will be used when one independent variable will be manipulated, a two-way ANOVA will be used when two independent variables will be manipulated, a three-way ANOVA will be used when three independent variables will be manipulated, and so on. The second half of the name tells us how these independent variables were measured. If it is an independent ANOVA then it means that different participants will take part in different conditions. If it is a repeated measures ANOVA then the same participants will take part in all experimental conditions. If it is a mixed ANOVA then it means that at least one independent variable will be measured using different participants and at least one independent variable will be measured using the same participants.
Example: Students (and lecturers for that matter) love their mobile phones, which is rather worrying given some recent controversy about links between mobile phone use and brain tumours. The basic idea is that mobile phones emit microwaves, and so holding one next to your brain for large parts of the day is a bit like sticking your brain in a microwave oven and selecting the ‘cook until well done’ button. If we wanted to test this experimentally, we could get 6 groups of people and strap a mobile phone on their heads (that they can’t remove). Then, by remote control, we turn the phones on for a certain amount of time each day. After 6 months, we measure the size of any tumour (in mm3) close to the site of the phone antennae (just behind the ear). The six groups experience 0, 1, 2, 3, 4 or 5 hours per day of phone microwaves for 6 months.
SPSS Output for One-Way Independent ANOVA
Figure 6.3 shows an error bar chart of the mobile phone data – this graph is not automatically produced by SPSS (1 did it using the interactive graphs facility). The bars show the mean size of brain tumour in each condition, and the funny ‘I’ shapes show the confidence interval (Cl) of these means. This is the range between which 95% of sample means would fall and tells us how well the mean represents the population (see page 135). It’s clear from this chart that there is a general trend that as mobile phone use increases so does the size of the brain tumour. Note that in the control group (0 hours), the mean size of the tumour is virtually zero (we wouldn’t actually expect them to have tumour) and the error bar shows that there was very little variance across samples. We’ll see later that this is problematic for the analysis.
SPSS Output 6.4 shows the table of descriptive statistics from the one-way ANOVA; we’re told the means, standard deviations, and standard errors of the means for each experimental condition. The means should correspond to those plotted in Figure 6.3. Remember that the standard error is the standard deviation of the sampling distribution of these data (so, for the 2-hour group, if you took lots of samples from the population from which these data come, the means of these samples would have a standard deviation of 0.49). We are also given confidence intervals for the mean; these correspond to the error bars plotted in Figure 6.3 and tell us the limits within which the means of 95 out of 100 samples would fall (so for the 2-hour group if we took 1 00 samples, 95 of them would have a mean tumour size between 1.03 mm3 and 1.49 mm3). These diagnostics are important for interpretation later on.
Figure 6.3: Error bar chart of the group means for different hours per day of mobile phone use
SPSS Output 6.4
Like the t-Test, o
ne of the assumptions of ANOVA is that the variances within experimental conditions are similar (homogeneity of variance). The next part of the output reports a test of this assumption, Levene’s test, which tests the null hypothesis that the variances of the groups are the same. If Levene’s test is significant (i.e. the value of Sig. is less than .05) then the variances are significantly different meaning that one assumption of the ANOVA has been violated. For these data, this is the case, because our significance is .000,4 which is considerably smaller than the criterion of .05. In these situations, we have to try to correct the problem and the most common way is to transform all of the data (see Howell, 2001, pp. 342–349). Be warned though, in my experience transformations are often about as helpful as the average London shop-assistant (i.e. not very). If the transformation doesn’t work then you should either use a non-parametric test (see Chapter 7) or you might have to just report an inaccurate F value and the Levene’s test (so that others can assess the accuracy of your results for themselves). For these data the problem has almost certainly arisen because in the control condition (0 hours) the variance is very small indeed – much smaller than all other conditions (most people had no tumour at all and so all scores would have been close to zero). Transforming these data has very little effect (try it if you don’t believe me) so bear this in mind when I discuss how to report these results.