How to Design and Report Experiments
Page 25
Box 6.2: Interpreting Interaction Graphs (Line Charts)
In Box 6.1 we had a look at how interactions could be represented using bar charts. In this section we’ve represented interactions using line charts. Actually, both are acceptable (although it’s more common for people to use line charts for interactions in textbooks, bar charts are used very frequently in published research). In the data for how text messaging affected grammar ability over time, we found a significant interaction and concluded (from Figure 6.8) that the decline in grammar ability due to text messaging was significantly greater than the change when no text messages were allowed. Would you conclude the same from the graph below?
This profile of data would almost certainly not produce a significant interaction term because the change in grammar ability due to text messaging is now the same as the change caused by not sending text messages. The lines are parallel, indicating that the change in grammar ability is the same in both the text messagers and controls. So, we usually know just from looking at an interaction graph whether there is likely to be an interaction effect: if the lines are parallel (or nearly parallel) there won’t be a significant interaction effect, but if the lines look non-parallel then there is a chance that there might be a significant interaction.
Let’s try another example. Is there an interaction now?
For these data there is an even stronger interaction than originally shown (see Figure 6.8). This is shown by the fact that the lines cross over. In fact, the control group shows no change in grammar ability (the line connecting the means from before and after the experiment is horizontal), but for the text messagers there is a decline in grammar ability. Crossed lines like these are non-parallel and, therefore, indicate the possibility of a significant interaction (although contrary to what some textbooks might have you believe it isn’t always the case that if the lines of the interaction graph cross then the interaction is significant)!
Calculating the Effect Size for Two-Way Mixed ANOVA
SPSS Output 6.19 provides us with the two things we need to calculate each effect size: the mean squares for the effect (MSM) and the mean squares for the error (MSR). The exact value of the experimental effect (MSm) depends on which effect we’re looking at, and the unsystematic variance (MSR) is different for the repeated measures effects and the between-group effect (note in SPSS Output 6.19 that there is an error term in each table). We can calculate the effect size for each of the effects in turn using the equation on page 181.
For the effect of time, SPSS Output 6.19 tells us that the mean squares for the effect is 1528.81, and the mean squares for the error is 98.91, hence:
Using the benchmarks for effect sizes this represents a large effect (it is around the threshold of .5). Therefore, the change in grammar ratings over time is a substantive finding.
We know already that the effect of group was non-significant but the effect size estimate indicates that there was actually a medium effect to be detected (you might remember that the significance value was close to .05). It’s worth considering the possibility that we didn’t detect this effect because our sample was relatively small.
Interestingly, the effect size for the interaction is not much larger than the effect size for the non-significant effect of group. This illustrates how statistical significance can be misleading. Nevertheless, the interaction between text messaging and the point in time when grammar was measured was a relatively strong effect.
Writing the Result for Two-Way Mixed ANOVA
Hopefully by now you know that for any effect in ANOVA we have to report the details of the F-ratio and the degrees of freedom from which it was calculated. The three effects we have here all have different F-ratios based on different degrees of freedom. For the main effects of time and group, and the time × group interaction, the model degrees of freedom were I (dfM = 1). The degrees of freedom for the residuals were 48 (dfR = 48) for both the repeated measures error term and the independent error term. We can report the three effects from this analysis as follows:
The results show that the grammar ratings at the end of the experiment were significantly lower than those at the beginning of the experiment, F(1, 48) = 15.46, p < .001, r = .61.
The main effect of group on the grammar scores was non-significant, F(1, 48) = 2.99, ns, r = .27. This indicated that when the time at which grammar was measured is ignored, the grammar ability in the text message group was not significantly different to the controls.
The time × group interaction was significant, F(1, 48) = 4.17, p < .05, r = .34, indicating that the change in grammar ability in the text message group was significantly different to the change in the control group. Specifically, there was a significant drop in grammar ability in the text message group, t(24) = 3.38, p < .01, r = .57, but a much weaker drop in ability in the control group, t(24) = 2.01, ns, r = .38. These findings indicate that although there was a medium effect size in the natural decay of grammatical ability over time (as shown by the controls) there was a much stronger effect when participants were encouraged to use text messages. This shows that using text messages accelerates the inevitable decline in grammatical ability.
6.10 Two-Way Repeated Measures ANOVA
* * *
The final ANOVA that we have to consider is when we want to measure everything using the same participants. So, we still have two independent variables but now both of them will be measured using the same participants. If we want to use a two-way repeated measures ANOVA then the ‘two-way’ part tells us that we have to manipulate two independent variables, and the ‘repeated measures’ part tells us that both of these should be manipulated using the same participants.
Example: In my wonderful statistics textbook (which is wonderful only because it is so big that it is handy for propping up wonky tables, is useful for weight training, can be used to knock out attackers in dark alleyways, and is a superb cure of insomnia) I use an example of the beer-goggles effect (Field, 2000, p. 310). This effect is known to us all and can be summarized as a severe perceptual distortion after imbibing vast quantities of alcohol. The specific visual distortion is that previously unattractive people, suddenly become the hottest thing since Spicy Gonzalez’ extra hot Tabasco-marinated chillies. In short, one minute you’re standing in a zoo admiring the Orangutans, and the next you’re wondering why someone would put Gail Porter (or whatever her surname is now) into a cage. Anyway, Field (2000) in a blatantly fabricated data set demonstrated that the beer-goggles effect was much stronger for men than women, and took effect only after two pints. Imagine we wanted to follow up this finding to look at what factors mediate the beer-goggles effect. Specifically, we thought that the beer-goggles effect might be made worse by the fact that it usually occurs in clubs, which have dim lighting. We took a sample of 26 men (because the effect is stronger in men) and gave them various doses of alcohol over four different weeks (0 pints, 2 pints, 4 pints and 6 pints of lager). This is our first independent variable, which we’ll call alcohol consumption, and it has four levels. Each week (and, therefore, in each state of drunkenness) participants were asked to select a mate in a normal club (that had dim lighting) and then select a second mate in a specially designed club that had bright lighting. As such, the second independent variable was whether the club had dim or bright lighting. The outcome measure was the attractiveness of each mate as assessed by a panel of independent judges. To recap, all participants took part in all levels of the alcohol consumption variable, and selected mates in both brightly- and dimly-lit clubs.
SPSS Output for Two-Way Repeated Measures ANOVA
Figure 6.11 shows a line chart displaying the mean attractiveness of the partner selected (with error bars) in dim and brightly lit clubs after the different doses of alcohol. The chart shows that in both dim and brightly lit clubs there is a tendency for men to select less attractive mates as they consume more and more alcohol.
SPSS Output 6.21 shows the means for all conditions in a table. These means correspond to those plo
tted in Figure 6.11.
When we looked at one-way repeated measures ANOVA we came across the assumption of sphericity (see page 183). This assumption has to be checked also when we have two repeated measure variables, the only complication being that we have to check it for every effect (including the interaction). SPSS Output 6.22 shows Mauchly’s sphericity test (see page 184) and, in fact a test is produced for each independent variable, and also the interaction between them. The variable lighting had only two levels (dim or bright) and so the assumption of sphericity doesn’t apply (see page 203) and SPSS doesn’t produce a significance value. However, for the effects of alcohol consumption and the interaction of alcohol consumption and lighting, we do have to look at Mauchly’s test. The significance values are both above .05 (they are .454 and .768, respectively) and so we know that the assumption of sphericity has been met for both alcohol consumption, and the interaction of alcohol consumption and lighting.
Figure 6.11 Line chart (with error bars showing the standard error of the mean – see page 202) of the mean attractiveness of the selected mate after different doses of alcohol in dim and brightly lit clubs
SPSS Output 6.21
SPSS Output 6.23 shows the main ANOVA summary table, and because both independent variables are measured in the same way (in this case both are repeated measures) all of the results appear in a single table. Like the other two-way ANOVAs we have encountered we have three effects to look at: two main effects (one for each independent variable) and one interaction term. The main difference to the other two-way ANOVAs we have looked at is that each of these effects has its own error term.
SPSS Output 6.22
SPSS Output 6.23
The main effect of lighting is shown by the F-ratio in the row labelled lighting. Lighting explains 1993.92 (SSM(Lighting)) units of variance compared with 2128.33 units explained by the unsystematic variation for lighting (SSR(Lighting)). These values are converted into average effects, or mean squares, by dividing by the degrees of freedom, which are 1 for the effect of lighting, and 25 for the unexplained variation. The resulting F-ratio is the mean square for lighting divided by the mean square error (1993.92/85.13 = 23.42). The significance of this value is .000, which is well below the usual cut-off point of .05. We can conclude that average attractiveness ratings were significantly affected by whether mates were selected in a dim or well-lit club. We can easily interpret this result further because there were only two levels. Figure 6.12 shows the mean attractiveness of mates in dim and well-lit clubs when the amount of alcohol drunk is ignored. Attractiveness ratings were higher in the well-lit clubs, so we could conclude that when we ignore how much alcohol was consumed, the mates selected in well-lit clubs were significantly more attractive than those chosen in dim clubs.
Figure 6.12 Mean attractiveness of selected mates in dim and brightly lit clubs when you ignore how much alcohol the participant had consumed. Error bars show the standard error of the mean (see page 134)
The main effect of alcohol consumption is shown by the F-ratio in the row labelled alcohol. The variance explained by this variable, SSM(Alcohol), is 38591.65 compared to 9242.60 units of unsystematic variation for that variable, SSR(Akohol) These sums of squares are converted to mean squares by dividing by their respective degrees of freedom, and the F-ratio for this effect is, as ever, the mean square for the effect divided by the mean square error (12863.89/123.24 = 104.39). The probability associated with this F-ratio is reported as .000 (i.e. p < .001), which is well below the critical value of .05. We can conclude that there was a significant main effect of the amount of alcohol consumed on the attractiveness of the mate selected. We know that generally there was an effect, but without further tests (e.g. post hoc comparisons) we can’t say exactly which doses of alcohol had the most effect. I’ve plotted the means for the four doses in Figure 6.13. This graph shows that when you ignore the lighting in the club, the attractiveness of mates is similar after no alcohol and two pints of lager but starts to rapidly decline at four pints and continues to decline after six pints.
SPSS Output 6.24 shows some post hoc tests for the main effect of alcohol (see pages 173 and 178). These tests compare the mean at each dose of alcohol with the means of all other doses but control for the number of tests that have been done (so that the overall probability of a Type I error never rises above .05). In this example I’ve chosen a Bonferroni correction, which is a generally accepted procedure (see page 173). The main column of interest is the one labelled Sig., but the confidence intervals also tell us the likely difference between means if we were to take other samples. If we took 100 pairs of samples from our population and calculated the difference between their means, then these confidence intervals tell us the boundaries between which 95% of these differences would fall. Obviously, if there were a genuine difference between a pair of group means, then we’d expect none of the 95 samples to generate a difference of zero. So, if our means are genuinely different the confidence interval should not cross zero. What Alcohol consumption these results show is that the mean attractiveness of the mate selected after no alcohol was not significantly different from that of the mate selected after two pints (p is 1, which is greater than .05). However, the mean attractiveness was significantly higher after no pints than it was after four pints and six pints (both ps are less than .001). This finding is consistent with what Field (2000, Chapter 8) reported from another load of completely made-up data. We can also see that the mean attractiveness after two pints was significantly higher than after four pints and six pints (again, both ps are less than .001). Finally, the mean attractiveness after four pints was significantly higher than after six pints (p is less than .001). So, we can conclude that the beer-goggles effect doesn’t kick in until after two pints, and that it has an ever-increasing effect (well, up to six pints at any rate!).
Figure 6.13 Mean attractiveness of selected mate after different quantities of alcohol when you ignore whether the selection was made in a dim or brightly lit club. Error bars show the standard error of the mean (see page 134)
SPSS Output 6.24
The interaction effect is shown by the F-ratio in the row labelled Lighting*Alcohol. This effect explains 5765.42 units of variation, SSM(interaction) compared to 6487.33 units explained by the unsystematic variance for this effect (SSR(interaction)). The resulting F-ratio is 22.22 (1921.81/86.50), which has an associated probability value of .000 (i.e. p < .001). As such, there is a significant interaction between the amount of alcohol consumed and the lighting in the club on the attractiveness of the mate selected. The means (see Figure 6.11 and SPSS Output 6.21) help us to interpret this effect. We know that if there were no interaction, then we’d expect the same change in attractiveness across the quantities of alcohol when there was dim lighting as when there was bright lighting (see Box 6.1 and Box 6.2). The fact there is a significant interaction tells us that the change in attractiveness due to alcohol was significantly different in dim clubs compared to bright ones. Figure 6.11 shows that the decline in attractiveness of mates after two pints is much more dramatic in dim clubs than in bright ones.
Can we be more precise about where the differences lie though? Well, one thing we can do is contrasts for the interaction term. These are tests that follow on from the main analysis and test specific predictions. The exact details of these tests are well beyond the scope of this book, but I cover them in a lot of detail in my statistics book if you want to know more (Field, 2000, Chapters 7 and 9). In essence they differ from post hoc tests in that rather than comparing everything with everything else, they compare a precise set of hypotheses. So, put simply, we specify certain groups that we’d like to compare. This has the advantage that we can conduct fewer tests, and therefore, we don’t have to be quite so strict to control the Type I error rate. Therefore, these tests have more power to detect effects than post hoc tests and so are usually preferable. SPSS Output 6.25 shows the output from a set of contrasts that compare each level of the alcohol variable to the previo
us level of that variable (this is called a repeated contrast in SPSS). So, it compares no pints with two pints (level 1 versus level 2), two pints with four pints (level 2 versus level 3) and four pints with six pints (level 3 versus level 4). As you can see from the output, if we just look at the main effect of group these contrasts tell us what we already know from the post hoc tests, that is, the attractiveness after no alcohol doesn’t differ from the attractiveness after two pints (F(1, 25) < 1), the attractiveness after four pints does differ from that after two pints (F(1, 25) = 84.32, p < .00 1) and the attractiveness after six pints does differ from that after four pints (F(1, 25) = 27.98, p < .001). More interesting is to look at the interaction term in the table. This compares the same levels of the alcohol variable, but for each comparison it is also comparing the difference between the means for the dim and brightly-lit clubs. One way to think of this is to look at Figure 6.11 and note the vertical differences between the means for dim and bright clubs at each level of alcohol. When nothing was drunk, the distance between the bright and dim means is quite small (it’s actually 3.42 units on the attractiveness scale), when two pints of alcohol are drunk the difference between the dim and well-lit club is still quite small (4.81 units to be precise). The first contrast is comparing the difference between dim and bright clubs when nothing was drunk with the difference between dim and bright clubs when two pints were drunk. So, it is asking ‘is 3.42 significantly different from 4.81?’ The answer is ‘no’, because the F-ratio is non-significant – in fact, it’s less than I (F(1, 25) < 1). The second contrast for the interaction is comparing the difference between dim and bright clubs when two pints were drunk (4.81) with the difference between dim and bright clubs when four pints were drunk (this difference is –13.54, note the direction of the difference has changed as indicated by the lines crossing in Figure 6.11). This difference is significant (F(1, 25) = 24.75, p < .001). The final contrast for the interaction is comparing the difference between dim and bright clubs when four pints were drunk (−13.54) with the difference between dim and bright clubs when six pints were drunk (this difference is 19.46). This contrast is not significant (F(1, 25) = 2.16, ns). So, we could conclude that there was a significant interaction between the amount of alcohol drunk and the lighting in the club. Specifically, the effect of alcohol after two pints on the attractiveness of the mate was much more pronounced when the lights were dim.