by Andy Field
Figure 6.15 Scatterplot showing the relationship between the time spent stalking before therapy and the time spent stalking after therapy
Calculating the Effect Size for ANCOVA
In ANCOVA the effect size is calculated in the same way as all of the other ANOVAs we’ve come across; we just need to find out the effect of the variables in the model (the independent variables and the covariate) and the error term. If we look at SPSS Output 6.29 we can get the value of the mean square for each effect (MSM), and the mean square for its associated error term (MSR). We can then place these values into the equation for the effect size on page 181.
For the effect of therapy, the mean squares for the experimental effect are 480.27, and the mean squares for the error term are 87.48.
Using the benchmarks for effect sizes this represents a medium to large effect (it is between the thresholds of .3 and .5). Therefore, the effect of a cattle prod compared to psychodyshamic therapy is a substantive finding.
For the effect of the covariate, the error mean squares is the same, but the effect is much bigger (MSM is 4414.60 rounded to 2 decimal places). If we place this value in the equation, we get the following:
Using the benchmarks for effect sizes this represents a very large effect (it is well above the threshold of .5, and is close to 1). Therefore, the relationship between initial stalking behaviour and the stalking behaviour after therapy is very strong indeed.
Writing the Result for ANCOVA
The write up for an ANCOVA is exactly the same as reporting the ANOVA, except that we have to report and interpret the effect of the covariate. For this analysis, we need to report two effects: the effect of therapy and the effect of the covariate. We can report these effects as follows (check back to SPSS Output 6.29 to see from where I got the degrees of freedom):
Levene’s test was significant, F(1, 48) = 7.19, p < .05, indicating that the assumption of homogeneity of variance had been broken. The main effect of therapy was significant, F(1, 47) = 5.49, p < .05, r = .39, indicating that the time spent stalking was lower after using a cattle prod (M = 55.30, SE = 1.87) compared to after psychodyshamic therapy (M = 61.50, SE = 1.87).
The covariate was also significant, F(1, 47) = 50.46, p < .001, r = .82, indicating that level of stalking before therapy had a significant effect on level of stalking after therapy (there was a positive relationship between these two variables).
6.12 Summary
* * *
This chapter has introduced you to most of the commonly used parametric statistical tests. We started (in fact it’s so long ago since I started writing the damn thing that I can’t even remember how we started . . .) by looking at simple situations in which you have only two experimental conditions (the t-test). The remainder of the chapter introduced you to Analysis of Variance (ANOVA). For each test we looked at an example, saw how we could interpret the output from SPSS (a commonly used statistical package) and started to think about how we could interpret and report the results. Along the way we discovered that we couldn’t always rely on our data meeting the assumptions of certain tests. The next chapter will tell us what to do when we break these assumptions.
6.13 Practical Tasks
* * *
For each example in this chapter:
Write a sample results section (using the summaries at the end of each example as a guide) but as you do so identify where all of the numbers come from and what they mean.
For each example in this chapter use Field (2000) or notes on my web page (http://www.cogs.susx.ac.uk/users/andyf/teaching):
Analyse the data and see if you can get the same outputs as I have (the SPSS data files can be found on my website).
6.14 Further Reading
* * *
It will come as no great shock to discover that I recommend my own book for more details on the statistical tests covered in this chapter. The whole reason I wrote the other book was because I wanted an integrated guide on statistical theory and SPSS, and I worked my socks off trying to make it the most entertaining guide possible. It’s got lots of theory in it, and lots of practical advice to help you get the results that I did using SPSS.
Field, A. P. (2000). Discovering statistics using SPSS for Windows: advanced techniques for the beginner. London: Sage. Chapters 6 (t-test), 7 (one-way ANOVA), 8 (ANCOVA and two-way independent ANOVA) and 9 (repeated measures ANOVAs).
If you don’t like my book, then try:
Howell, D. C. (2001). Statistical methods for psychology (5th edition). Belmont, CA: Duxbury. This is the definitive textbook for theory, but some undergraduates do find it too complex.
Notes
1 As we saw in the section on p. 132, the standard error is just the sample standard deviation divided by the square root of the sample size (SE = s / √N), so for the ‘Marie Claire’ condition SE = 4.71/ √10 = 4.71/3.16 = 1.489.
2 I used the two-tailed probability because I made no specific prediction about the direction of the effect. However, had I predicted before the experiment that happiness would be higher after reading ‘Marie Claire’ then this would be a one-tailed hypothesis. Often in research we can make specific predictions about which group has the highest mean. In this case, we can use a one-tailed test (for more discussion of this issue see Box 5.3 and Field, 2000). It might seem strange that SPSS produces only the two-tailed significance and that there isn’t an option that can be selected to produce the one-tailed significance. However, there is no need for an option because, when the results are in the direction that you predicted, the one-tailed probability can be ascertained by dividing the two-tailed significance value by 2. In this case, the two-tailed probability was .048, therefore the one-tailed probability is .024 (= .048/2). However, if the results are in the opposite direction to what you predicted you have to keep your two-tailed significance value.
3 In fact, if you used ANOVA when you have only two experimental conditions you’d reach identical conclusions to when a two-tailed t-test is used.
4 SPSS rounds values to three decimal places. Therefore, when you see a significance value of .000 in SPSS this isn’t zero, it’s just zero to 3 decimal places; the actual value could be .0001 or .000004768. As far as we’re concerned we just report these significance values as p < .001.
5 For the benefit of Garth Brook’s lawyers, if they’re reading, I should add a disclaimer that this is only the opinion of the authors and that we’re sure all of his CDs are wonderful really.
6 In case you haven’t had the pleasure their record label has a website http://www.dischord.com
7 Although this punished them for any attempts to use a mobile phone, an unfortunate side effect was that 10 of the sample developed conditioned phobias of porridge after repeatedly trying to heat some up in the microwave!
8 It’s interesting that the control group means dropped too. This could be because the control group were undisciplined and still used their mobile phones, or it could just be that the education system in this country is so under funded that there is no-one to teach English anymore!
7 Non-parametric Statistics
* * *
In the last chapter, we saw that to do parametric statistics we needed certain assumptions to be met. We also saw that these assumptions are not always met and that often we can’t even transform the data to make them conform to the assumptions we require. So, what on earth do we do in these situations? Luckily, a whole range of tests have been developed that do not require parametric assumptions, these are called non-parametric tests. This chapter concentrates on the four most commonly used non-parametric tests and, like the last chapter, I won’t dwell on the theory behind them. I’ll just tell you when they should be used, show you an output from SPSS and explain how to interpret and write up the results. I hope that I can do this in fewer pages than the parametric tests!
7.1 Non-Parametric Tests: Rationale
* * *
Non-parametric tests are sometimes known as assumption-free tests because they make less s
trict assumptions about the distribution of data being analysed.1 The way they get around the problem of the distribution of the data is by not using the raw scores. Instead, the data are ranked. The basic idea of ranking is that you find the lowest score and give it a rank of 1, then find the next highest score and give it a rank of 2, and so on. As such, high scores are converted into large ranks, and low scores are converted into small ranks. The analysis is carried out on the ranks rather than the actual data. Ranking is an ingenious way around the problem of using non-normal data but it has a price: by ranking the data we lose some information about the magnitude of difference between scores and because of this non-parametric tests are less powerful than the parametric counterparts. I talked about the notion of statistical power earlier on (see page 154): it refers to the ability of a test to find an effect that genuinely exists. The fact that non-parametric tests are generally (but not always) less powerful means that if there is a genuine effect in our data, then, if its assumptions are met, a parametric test is more likely to detect it than a non-parametric one. Put a different way, there is an increased chance of a Type II error (i.e. more chance of accepting that there is no difference between groups when, in reality, a difference exists).
7.2 The Mann-Whitney Test
* * *
The Mann-Whitney test is the non-parametric equivalent of the independent t-test (see page 163) and so is used for testing differences between groups when there are two conditions and different participants have been used in each condition.
Example: A psychologist was interested in the cross-species differences between men and dogs. She observed a group of dogs and a group of men in a naturalistic setting (20 of each). She classified several behaviours as being dog-like (urinating against trees and lampposts, attempts to copulate with anything that moved, and attempts to lick their own genitals). For each man and dog she counted the number of dog-like behaviours displayed in a 24-hour period. It was hypothesized that dogs would display more dog-like behaviours than men.2
This psychologist, having collected the data, noticed that the data were non-normal in the dog condition. In fact the Kolmogorov-Smirnov test (see page 160) was highly significant (D(20) = .244, p < .01) for the dogs but wasn’t for the men (D(20) = .175, ns) (see SPSS Output 7.1). The fact that the dog data are non-normal tells us that a non-parametric test is appropriate.
SPSS Output 7.1
Output from the Mann-Whitney Test
The Mann-Whitney test looks for differences in the ranked positions of scores in the two groups. It makes sense then that SPSS first summarizes the data after it has been ranked. It tells us the average and total ranks in each condition (see SPSS Output 7.2). I told you earlier that scores are ranked from lowest to highest: therefore, the group with the lowest mean rank will have more low scores in it than the group with the highest mean rank. Therefore, this initial table tells us which group had the highest scores, which enables us to interpret a significant result should we find one.
The second table (SPSS Output 7.3) provides the actual test statistics for the Mann-Whitney test. Actually, although Mann and Whitney get all of the credit for this test, Wilcoxon also came up with a statistically comparable technique for analysing ranked data. The form of the test commonly taught is that of the Mann-Whitney test (I’m sure this has happened only because it would be confusing to many students and researchers to have two Wilcoxon tests that were used in different situations!). However, Wilcoxon’s version of the test can be converted into a z-score and, therefore, can be compared against critical values of the normal distribution. This is handy because it means we can find out the exact significance rather than relying on printed tables of critical values. SPSS provides both statistics and the z-score for the Wilcoxon statistic.
SPSS Output 7.3 provides the value of Mann-Whitney’s U statistic, the value of Wilcoxon’s statistic and the associated z approximation. The z approximation becomes more accurate as sample sizes increase, so the bigger your sample, the more you can be confident in this statistic – and for very small samples you probably shouldn’t use it at all. The important part of the table is the significance value of the test, which gives the two-tailed probability that the magnitude of the test statistic is a chance result (see Box 5.3). The two-tailed probability is non-significant because the significance value is greater than .05 (see Chapter 5). However, the psychologist made a specific prediction that dogs would be more dog-like than men, so we can actually halve the probability value to give us the one-tailed probability (.88/2 = .44) but even this is non-significant. This finding indicates that men and dogs do not differ in the amount that they display dog-like behaviour. If we look at the ranks for each group (SPSS Output 7.2) we see that the mean rank for the dogs was 20.77, and for the men was 20.23. Therefore the ranks were pretty equivalent.
SPSS Output 7.2
SPSS Output 7.3
A good way to display non-parametric data3 is by using a boxwhisker diagram (or boxplot for short). Figure 7.1 shows such a plot, and it should be clear that the plot gets its name because it is a shaded box with two whiskers coming out of it! The shaded box represents the range between which 50% of the data fall (this is called the interquartile range). The horizontal bar within the shaded box is the median (see page 117). The ‘I’ shape shows us the limits within which most or all of the data fall. Generally, the lower bar is the lowest score and the upper bar is the highest score in the data, however if there is an outlier (a score very different from the rest) then it will fall outside of the bars and the bars represent all of the data that fall within ±3 standard deviations of the mean. In this example there is an outlier and you can see it is represented by a circle above the top of the bar for the men. Why is this graph better than plotting the means? Well, non-parametric tests are not testing differences between means; they are testing differences between ranks. Means are biased by things like outliers (see Box 4.3), whereas ranks are not. Therefore, the median is likely to better represent what the non-parametric test is actually testing (because it too is not influenced by outliers). A box-whisker plot shows the median and so better represents what the non-parametric test is looking at.
Figure 7.1 Boxplot for the dog-like behaviour in dogs and men
Calculating an Effect-Size
Effect sizes are really easy to calculate thanks to the fact that SPSS converts the test statistics into a z-score. The equation to convert a z-score into the effect size estimate, r is as follows (from Rosenthal, 1991, p. 19):
in which z is the z-score that SPSS produces, and N is the size of the study (i.e. the number of observations) on which z is based. In this case SPSS Output 7.3 tells us that z is –.15, and we had 20 men and 20 dogs so the total number of observations was 40. The effect size is, therefore:
This represents a tiny effect (it is close to zero), which tells us that there truly isn’t much difference between dogs and men.
Writing and Interpreting the Results
For the Mann-Whitney test we need only report the test statistic (which is denoted by U) and its significance. So, we could report something like:
Men (Mdn = 27) did not seem to differ from dogs (Mdn = 24) in the amount of dog-like behaviour they displayed (U = 194.5, ns).
Note that I’ve reported the median for each condition. Of course, we really ought to include the effect size as well. We could do two things. The first is to report the z-score associated with the test statistic. This value would enable the reader to determine both the exact significance of the test, and to calculate the effect size r:
Men (Mdn = 27) and dogs (Mdn = 24) did not significantly differ in the extent to which they displayed dog-like behaviours, U = 194.5, ns, Z = –.15.
The alternative is to just report the effect size (because readers can convert back to the z-score if they need to for any reason). This approach is better because the effect size will probably be most useful to the reader.
Men (Mdn = 27) and dogs (Mdn = 24) did not significantly differ in th
e extent to which they displayed dog-like behaviours, U = 194.5, ns, r = –.02.
7.3 The Wilcoxon Signed-Rank Test
* * *
The Wilcoxon signed-rank test is the non-parametric equivalent of the dependent t-test (see page 168) and so is used for testing differences between groups when there are two conditions and the same participants have been used in both conditions.
Example: there’s been much speculation over the years about the influence of subliminal messages on records. To name a few cases, both Ozzy Osbourne and Judas Priest have been accused of putting backward masked messages on their albums that subliminally influence poor unsuspecting teenagers into doing things like blowing their heads off with shotguns. A psychologist was interested in whether backward masked messages really did have an effect. He took the master tapes of Britney Spears’ ‘Baby one more time’ and created a second version that had the masked message ‘deliver your soul to the dark lord’ repeated in the chorus. He took this version, and the original and played one version (randomly) to a group of 32 people. He took the same group 6 months later and played them whatever version they hadn’t heard the time before. So each person heard both the original, and the version with the masked message, but at different points in time. The psychologist measured the number of goats that were sacrificed in the week after listening to each version. It was hypothesized that the backward message would lead to more goats being sacrificed.