How to Design and Report Experiments
Page 22
SPSS Output 6.6 shows the main ANOVA summary table. The table is divided into between-group effects (effects due to the model – the experimental effect, or SSM as I usually call it) and within-group effects (this is the unsystematic variation in the data, or SSR). The between-group effect is the overall experimental effect, and in this row of the table we are told the sums of squares for the model (the total experimental effect, SSM = 450.66) and the degrees of freedom on which these sums of squares are based (dfM = 5). The sum of squares for the experimental effect is a total and, as such, depends on the number of scores that have been added up (reflected by the degrees of freedom). Now, the number of scores used to calculate the various sums of squares differs for the experimental effect and the unsystematic variance, so to make them comparable we actually use the average sums of squares – the mean squares, MS. To get the mean sum of squares, we divide the sum of squares by its associated degrees of freedom (MSM = 450.66/5 = 90.13). This value is the experimental effect, the systematic variance. The row labelled Within groups gives details of the unsystematic variation within the data (the variation due to natural individual differences in brain tumour size and different reactions to the microwaves emitted by the phones). Again, this can be expressed as a sum of squared errors (SSR = 38.09) with an associated degrees of freedom (dfR = 114), or as a mean squared error calculated by dividing the sum of squares by its degrees of freedom (MSR = 38.09/114 = 0.334). The test of whether the group means are the same is represented by the F-ratio, which is simply the ratio of systematic variance to unsystematic variance, or put another way, the size of experimental effect compared to the size of error. Therefore, F is simply MSM/MSR, which in this case is 90.13/0.33 = 269.73. Finally, SPSS tells us the probability of getting a F-ratio of this magnitude by chance alone; in this case the probability is .000 (that’s 0 to 3 decimal places). As we know already, social scientists use a cut-off point of .05 as their criterion for statistical significance. Hence, because the observed significance value is less than .05 we can say that there was a significant effect of mobile phones on the size of tumour. However, at this stage we still do not know exactly what the effect of the phones was (we don’t know which groups differed).
Having established that the ANOVA is significant, we can now look at some post hoc tests. SPSS provides you with 18 options for different post hoc tests that vary in their accuracy, their power and their ability to control the Type I error rate. I summarized these procedures in ‘Discovering statistics’ (Field, 2000) and to cut a long story short I suggested the following:
When you have equal sample sizes and you are confident that your homogeneity of variances assumption has been met then use REGWQ or Tukey HSD as both have good power and tight control over the Type I error rate.
Bonferroni is generally conservative (it is strict about whether it accepts a test statistic as being significant), but if you want guaranteed control over the Type I error rate then it is the test to use.
If sample sizes across groups are slightly different then use Gabriel’s procedure because it has greater power, but if sample sizes are very different use Hochberg’s GT2.
If there is any doubt that the population variances are equal then use the Games-Howell procedure because this generally seems to offer the best performance. I recommend running the Games-Howell procedure in addition to any other tests you might select because of the uncertainty of knowing whether the population variances are equivalent.
SPSS Output 6.7 shows the output from the Games-Howell post hoc procedure. It is clear from the table that each group of participants is compared to all of the remaining groups. For each pair of groups the difference between group means is displayed, the standard error of that difference, the significance level of that difference and a 95% confidence interval. First, the control group (0 hours) is compared to the 1- hour, 2-hour, 3-hour, 4-hour and 5-hour groups and reveals a significant difference in all cases (all the values in the column labelled Sig. are less than .05). In the next part of the table, the 1-hour group is compared to all other groups. Again all comparisons are significant (all the values in the column labelled Sig. are less than .05). In fact, all of the comparisons appear to be highly significant except the comparison between the 4-hour and 5-hour groups, which is non-significant because the value in the column labelled Sig. is bigger than .05.
SPSS Output 6.7
SPSS Output 6.8 shows a different type of post hoc test, the REGWQ. I’ve included this because the table is different. With the REGWQ and some other tests (Hochberg’s GT2 for example) a table is produced in which group means are placed in the same column only if they are not significantly different using a criterion probability of . 05. In this example the only two means that appear in the same column are those of 4- and 5-hours on the phone per day (see the last column). This tells us that all group means were significantly different from each other except for the groups that spent 4- and 5-hours on the phone.
SPSS Output 6.8
Does it matter which test you use? Well, actually it does. In this example you’d find that if you used a Bonferroni test then the group that spent I hour on the phone per day are not significantly different to those that spent no hours on the phone per day (as practice why not try it on SPSS, the data files can be downloaded from my website). In reality these groups do appear to be different but we only detected it because we used a post hoc test that wasn’t affected by the fact our groups had different variances. Bonferroni is not only conservative, but will also be affected by the differences in group variances.
Calculating the Effect Size for One-Way Independent ANOVA
SPSS Output 6.6 provides us with three measures of variance: the between-group effect (SSM), the within-subject effect (SSR) and the total amount of variance in the data (SST). We know from Box 5.1 that we can calculate r2 using these values (although for some bizarre reason it’s usually called eta-squared, η2). It is then a simple matter to take the square root of this value to give us the effect size r.
Using the benchmarks for effect sizes this represents a massive effect (it is well above .5 – the threshold for a large effect). Therefore, the effect of mobile phone use on brain tumours is a substantive finding.
This measure of effect size is actually slightly biased because we’re interested in the effect size in the population and this value is based purely on sums of squares from the sample (with no adjustment made for the fact that we’re trying to estimate a population value). To reduce this bias, we can use a slightly more complex measure of effect size known as omega squared (ω2). This effect size estimate is still based on the sums of squares that we met in Box 5.1, but like the F-ratio it uses the variance explained by the model, and the error variance (in both cases the average variance, or mean squared error, is used):
The n in the equation is simply the number of people in a group (in this case 20). So, in this example we’d get (the values are taken from SPSS Output 6.6):
As you can see this has led to an identical estimate to using r, but this will not always be the case because ω is generally a more accurate measure. For the remaining sections on ANOVA I will use omega as my effect size measure, but think of it as you would r (because it’s basically an unbiased estimate of r anyway).
Interpreting and Writing the Result for One-Way Independent ANOVA
When we report an ANOVA, we have to give details of the F-ratio and the degrees of freedom from which it was calculated. For the experimental effect in these data the F-ratio was derived from dividing the mean squares for the effect by the mean squares for the residual. Therefore, the degrees of freedom used to assess the F-ratio are the degrees of freedom for the effect of the model (dfM = 5) and the degrees of freedom for the residuals of the model (dfR = 114). As with the t-test it is also customary to either begin by stating a general level of significance that you’re using, or to report the significance of each effect against standard benchmarks such as .05, .01, and .001. Therefore, we could report the main finding as
:
The results show that using a mobile phone significantly affected the size of brain tumour found in participants, F(5, 114) = 269.73, p < .001.
Why have I indicated that this is incorrect? Well, in this example, we need to report that the homogeneity of variance assumption was broken as well as our main result. It would be incorrect not to mention this fact. Levene’s test is actually just an ANOVA (see Field, 2000, p. 284) and so we can report it using the letter F. A more thorough interpretation might be to say:
Levene’s test indicated that the assumption of homogeneity of variance had been violated, F(5, 114) = 10.25, p < .001. Transforming the data did not rectify this problem and so F-tests are reported nevertheless. The results show that using a mobile phone significantly affected the size of brain tumour found in participants, F(5, 114) = 269.73, p < .001.
Of course we should really report the effect size as well:
Levene’s test indicated that the assumption of homogeneity of variance had been violated, F (5, 114) = 10.25, p < .001. Transforming the data did not rectify this problem and so F-tests are reported nevertheless. The results show that using a mobile phone significantly affected the size of brain tumour found in participants, F(5, 114) = 269.73, p < .001, r = .96. The effect size indicated that the effect of phone use on tumour size was substantial.
The next thing that needs to be reported are the post hoc comparisons. It is customary just to summarise these tests in very general terms like this:
Games-Howell post hoc tests revealed significant differences between 0 and 1 hour (p < .00 1), 0 and 2 hours (p < .001), 0 and 3 hours (p < .001), 0 and 4 hours (p < .001), 0 and 5 hours (p < .001), I and 2 hours (p < .001), 1 and 3 hours (p < .001), I and 4 hours (p < .001), I and 5 hours (p < .001), 2 and 3 hours (p < .001), 2 and 4 hours (p < .001), 2 and 5 hours (p < .001), 3 and 4 hours (p < .001), 3 and 5 hours (p < .001), but not between 4 and 5 hours (ns).
This is rather cumbersome, and the same information could be provided in a more succinct way:
Games-Howell post hoc tests revealed significant differences between all groups (p < .00 I for all tests) except between 4- and 5-hours (ns).
If you do want to report the results for each post hoc test individually, then at least include the 95% confidence intervals for the test as these tell us more than just the significance value. In examples such as this when there are many tests it might be as well to summarize these confidence intervals as a table:
Table 6.1
6.7 One-Way Repeated Measures ANOVA
* * *
If you plan to have three or more experimental groups (you will compare three or more means), and the same participants will be used in each group (so each person contributes several scores to the data) then you will use one-way repeated measures ANOVA to analyse the data.
Sphericity
In independent ANOVA the accuracy of the F-test depends upon the assumption that the groups tested are independent. As such, the relationship between treatments in a repeated measures design causes the conventional F-test to lack accuracy. The relationship between scores in different conditions means that we have to make an additional assumption to those in between-group ANOVA. Put simplistically, we assume that the relationship between one pair of conditions is similar to the relationship between a different pair of conditions. This assumption is known as the assumption of sphericity. Sphericity refers to the equality of variances of the differences between treatment levels. So, if you were to take each pair of treatment levels, and calculate the differences between each pair of scores, then it is necessary that these differences have equal variances (see Field, 1998a, for an example).
SPSS produces a test known as Mauchly’s test, which tests the hypothesis that the variances of the differences between conditions are equal. Therefore, if Mauchly’s test statistic is significant (i.e. has a probability value less than .05) we conclude that some pairs of conditions are more related than others and the condition of sphericity has not been met. If, however, Mauchly’s test statistic is nonsignificant (i.e. p > .05) then it is reasonable to conclude that the relationships between pairs of conditions are roughly equal. In short, if Mauchly’s test is significant then be wary of the F-ratios produced by the computer!
The effect of violating sphericity is a loss of power (i.e. increased probability of a Type II error) because the test statistic (F-ratio) simply cannot be assumed to have an F-distribution (for more details see Field, 1998a, 2000). However, as we’ll see later there are some things we can do to correct the problem.
Example: Imagine I wanted to look at the effect alcohol has on the roving eye. The ‘roving eye’ effect is the propensity of people in relationships to ‘eye-up’ members of the opposite sex. I took 20 men and fitted them with incredibly sophisticated glasses that could track their eye movements and record both the movement and the object being observed (this is the point at which it should be apparent that I’m making it up as I go along). Over 4 different nights I plied these poor souls with either 1, 2, 3 or 4 pints of strong lager in a nightclub. Each night I measured how many different women they eyed-up (a women was categorized as having been eyed-up if the man’s eye moved from her head to toe and back up again). To validate this measure we also collected the amount of dribble on the man’s chin while looking at a woman.
SPSS Output for One-Way Repeated Measures ANOVA
Figure 6.4 shows an error bar chart of the roving eye data. The bars show the mean number of women that were eyed-up after different doses of alcohol. As in the previous section, I have included error bars representing the confidence interval of these means (see pages 135 and 176 for an explanation). It’s clear from this chart that the mean number of women is pretty similar between I and 2 pints, and for 3 and 4 pints but there is a jump after 2 pints.
Figure 6.4 Error bar chart of the mean number of women eyed-up after different doses of alcohol
SPSS Output 6.9 shows the initial diagnostics statistics. First, we are told the variables that represent each level of the independent variable. This box is useful to check that the variables were entered in the correct order. The next table provides basic descriptive statistics for the four levels of the independent variable. This table confirms what we saw in the graph; that is, that the mean number of women eyed-up after 1 and 2 pints is similar, and the means after 3 and 4 pints are similar also, but there is an increase between 2 and 3 pints. These mean values are useful for interpreting any effects that may emerge from the main analysis.
SPSS Output 6.9
Earlier I mentioned that SPSS produces a test of whether the data violate the assumption of sphericity. The next part of the output contains Mauchly’s test and we hope to find that it’s non-significant if we are to assume that the condition of sphericity has been met. SPSS Output 6.10 shows Mauchly’s test and the important column is the one containing the significance value. The significance value (.022) is less than the critical value of .05, so we accept that the variances of the differences between levels are significantly different. In other words, the assumption of sphericity has been violated. Obviously we were hoping not to have violated this assumption, but seeing as we have (this always seems to happen when I make up data!) what do we do?
If data violate the sphericity assumption there are three corrections that can be applied to produce a valid F-ratio: corrections based on (1) the Greenhouse-Geisser (1959) estimate of sphericity; (2) the Huynh-Feldt (1976) estimate of sphericity; and (3) the lowest possible estimate of sphericity (the lower-bound). All of these estimates give rise to a correction factor that is applied to the degrees of freedom used to assess the observed F-ratio. In all cases they lower the degrees of freedom, which in real terms means that your F-ratio has to be bigger to achieve significance. We don’t need to go into the tedious detail of how these corrections are calculated (but see Girden (1992) if you’re having sleepless nights about it), we need know only that the three estimates produce different correction factors. With all three estimates, the clo
ser they are to 1.00 the more homogeneous the variances of differences, and hence the closer the data are to being spherical. There has been some debate about which correction factor is best (see Field, 1998, 2000) and to summarize, Girden (1992) recommends that when estimates of sphericity are greater than .75 then the Huynh-Feldt correction should be used, but when sphericity estimates are less than . 75 or nothing is known about sphericity at all, then the Greenhouse-Geisser correction should be used instead. If we look at the estimates of sphericity (in SPSS Output 6.10 in the column labelled Epsilon) the Greenhouse-Geisser is .75 (rounded) and so by Girden’s recommendation we should use the Huynh-Feldt correction (see later). Had this estimate been less than .75 the Greenhouse-Geisser correction should be used.
SPSS Output 6.10
SPSS Output 6.11 shows the main result of the ANOVA. This table is essentially the same as the one for one-way independent ANOVA (see SPSS Output 6.6). There is a sum of squares for the main effect of alcohol, which tells us how much of the total variability is explained by the experimental effect. There is also an error term, which is the amount of unexplained variation across the conditions of the repeated measures variable. These sums of squares are converted into mean squares in the same way as for the independent ANOVA: by dividing by the degrees of freedom. The df for the effect of alcohol is simply the number of levels of the independent variable minus 1 (k – 1) and the error df is this value multiplied by one less than the number of participants in each group: (n – 1)(k – 1). The F-ratio is obtained by dividing the mean squares for the experimental effect (75.03) by the error mean squares (15.87). As with independent ANOVA, this test statistic represents the ratio of systematic variance to unsystematic variance. The value of F (4.73) is then compared against a critical value for 3 and 57 degrees of freedom. SPSS displays the exact significance level for the F-ratio. The significance of F is .005, which is significant because it is less than the criterion value of .05. We can, therefore, conclude that alcohol had a significant effect on the average number of women that were eyed-up. However, this main test does not tell us which quantities of alcohol made a difference to the number of women eyed-up.