by Andy Field
SPSS Output 7.4 shows that the data in both conditions were non-normal. In fact the Kolmogorov-Smirnov test (see page 160) was significant (D(32) = .177, p < .05) when there was a message on the record and even more significant when there wasn’t a message (D(32) = .236, p < .001). Therefore, a non-parametric test is appropriate.
SPSS Output 7.4
Output from the Wilcoxon Test
Like the Mann-Whitney test, the Wilcoxon test looks for differences in the ranked positions of scores in the two conditions. In fact, the data are ranked and then the differences between the ranks in the two conditions are examined. These differences between ranks can be positive (the rank in condition 2 is bigger than the rank in condition 1), negative (the rank in condition 2 is smaller than the rank in condition 1) or tied (the ranks in the two conditions are identical). SPSS Output 7.5 shows a summary of these ranked data; it tells us the number of negative ranks (these are people who sacrificed more goats after hearing the subliminal message than after not hearing a message) and the number of positive ranks (participants who sacrificed more goats after not hearing the message). The footnotes under the table enable us to determine to what the positive and negative ranks refer. The table shows that 11 of the 32 participants sacrificed more goats after hearing the subliminal message, whereas 17 out of 32 sacrificed more goats after not hearing the message. There were also 4 tied ranks (i.e. participants who sacrificed the same number of goats after listening to the different versions of the song). The table also shows the average number of negative and positive ranks and the sum of positive and negative ranks.
SPSS Output 7.5
If we were doing a Wilcoxon test by hand, the test statistic we use is the sum of ranks. In fact, we would calculate two test statistics: the sum of positive ranks (T+) and the sum of negative ranks (T–). SPSS Output 7.5 presents these two values in the final column of the table (T+ = 294.50 and T– = 111.50). We don’t actually use both of these test statistics; instead we use only the one that has the lowest value. In this case our test value would be the sum of negative ranks (T = 111.50). Normally, we’d then compare this value to tabulated values of the Wilcoxon test. However, like we saw in the previous section, the Wilcoxon test can be converted to a z-score. The advantage of this approach is that it allows exact significance values to be calculated based on the normal distribution. SPSS Output 7.6 tells us that the test statistic is based on the negative ranks, that the z-score is –2.094 and that this value is significant at p = .036. Like with the Wilcoxon test, the z-score becomes more accurate as sample sizes increase, so be wary of it for small samples. Before I go on, remember that there are three types of people: (1) those who sacrificed more goats after the message (negative ranks), (2) those who sacrificed more goats after the normal version of the song (positive ranks), and (3) those who sacrificed the same number of goats regardless of whether they heard the message or not (ties). Most people fall into category two (these are the people with positive ranks) and we can tell this because the mean rank is higher for the positive ranks. So, this means that most people fell into the category of sacrificing more goats after not hearing the message. Simplistically, the test is telling us that there were significantly more people who had positive ranks (sacrificed more goats after no message was heard) than had negative ranks (sacrificed more goats after hearing the message). Therefore, we could conclude that significantly more goats were sacrificed after listening to the normal version of the song compared to after hearing the song with the message! So, although Britney does make people sacrifice their souls to the dark lord, telling people to do so seems to put people off (no-one likes being told what to do, do they?).4 This is the opposite direction to our hypothesis and as such we have to keep our 2-tailed significance value of .036. Had the conclusion been in the same direction that we’d predicted (if the message had led to more goats being killed) we could have used a 1-tailed significance value (.036/2).
SPSS Output 7.6
We can again display these data with a box-whisker diagram. Figure 7.2 shows such a plot and you can see that after the message was heard the median number of goats sacrificed was less than after no message. However, after no message was heard there were a lot more outliers (i.e. a few people went on crazy goat-killing sprees!). One of the great things about having used a non-parametric test is that we know that these outliers will not bias the results (because it was the ranks that were analysed not the actual data collected). The fact that the median is lower after listening to a message confirms the direction of our conclusions (i.e. that significantly more goats were sacrificed after listening to the normal version of the song).
Figure 7.2 Boxplot showing the number of goats sacrificed after listening to Britney Spears compared to after listening to a version of Britney Spears that had a masked message in the chorus of the song
Calculating an Effect-Size
The effect size can be calculated in the same way as for the Mann-Whitney test (see the equation on page 238). In this case SPSS Output 7.6 tells us that z is –2.094, and we again had 40 observations (although we only used 20 people and tested them twice it is the number of observations, not the number of people, that is important here). The effect size is, therefore:
This represents a medium effect (it is close to Cohen’s benchmark of .3), which tells us that the effect of whether or not a subliminal message was present was a substantive effect.
Writing and Interpreting the Results
For the Wilcoxon test, we need only report the test statistic (which we saw earlier is denoted by T) and its significance. So, we could report something like:
The number of goats sacrificed after hearing the message (Mdn = 9) was significantly less than after hearing the normal version of the song (Mdn = 11), T = 111.50, p < .05.
As with the Mann-Whitney test we should report either the z-score, or the effect size. The effect size is most useful:
The number of goats sacrificed after hearing the message (Mdn = 9) was significantly less than after hearing the normal version of the song (Mdn = 11), T = 111.50, p < .05, r = –.33.
7.4 The Kruskal-Wallis Test
The Kruskal-Wallis test is the non-parametric equivalent of one-way independent ANOVA (see page 174) and so is used for testing differences between groups when there are more than two conditions and different participants have been used in all conditions (each person contributes only 1 score to the data).
Example: A researcher was interested in trying to prevent coulrophobia (fear of clowns) in children. She decided to do an experiment in which different groups of children (15 in each) were exposed to different forms of positive information about clowns. The first group watched some adverts for McDonald’s in which their mascot Ronald McDonald is seen cavorting about with children going on about how they should love their mum. A second group was told a story about a clown who helped some children when they got lost in a forest (although what on earth a clown was doing in a forest remains a mystery). A third group was entertained by a real clown, who came into the classroom and made balloon animals for the children.5 A final group acted as a control condition and they had nothing done to them at all. The researcher took self-report ratings of how much the children liked clowns (rather like the fear-beliefs questionnaire in Chapter 2) resulting in a score for each child that could range from 0 (not scared of clowns at all) to 5 (very scared of clowns).
SPSS Output 7.7 shows that the Kolmogorov-Smirnov test (see page 160) was significant for the control group (D(15) = .419, p < .001), for the group exposed to a real clown (D(15) = .230, p < .05) and was nearly significant for the group who received the story about the clown (D(15) = .217, p = .056). In this latter case the Shapiro-Wilk test is in fact significant and this test is actually more accurate (though less widely reported) than the Kolmogorov-Smirnov test (see Field, 2000, Chapter 2). The only group that produced approximately normal data were the group who saw the adverts (D(15) = .173, ns).
SPSS Output 7.7
Output fro
m the Kruskal-Wallis Test
Like the other non-parametric tests we’ve come across, the Kruskal-Wallis test analyses the ranked data and so SPSS Output 7.8 shows a summary of these ranked data; it tells us the mean rank in each condition. These mean ranks are important later for interpreting any effects.
SPSS Output 7.8
If we were doing a Kruskal-Wallis test by hand we would rank the data ignoring the group to which the score belongs. We then work out the sum of ranks for each group. The test statistic (H) is a function of these total ranks and the sample size on which they are based. The test statistic has a special kind of a distribution called a chi-square distribution, which for the Kruskal-Wallis test has k – 1 degrees of freedom, where k is the number of groups (see Box 8.1). SPSS Output 7.9 shows this test statistic (SPSS labels it chi-square rather than H) and its associated degrees of freedom (in this case we had 4 groups so the degrees of freedom are 4 – 1, or 3), and the significance. Therefore, we could conclude that the type of information presented to the children about clowns significantly affected their fear ratings of clowns. Like a one-way ANOVA though, this test tells us only that a difference exists; it doesn’t tell us exactly where the difference lies.
SPSS Output 7.9
One way to see which groups differ is to look at a boxplot of the groups (see Figure 7.3). The first thing to note is that this boxplot is a bit odd because the group that received no information don’t have any whiskers on their box, and those that received stories have only one whisker! Just to add to the confusion, the lines representing the medians appear to be missing. Well, this is what happens when you have data that are measured on a very limited scale (a child can only score 0, 1, 2, 3, 4 or 5). The medians are actually shown, but they clash with the lower quartile (the lower line of the box) for all conditions except the advert one (note the bottom line is thicker – that’s the median line). The whiskers are missing in the control condition because the highest and lowest scores clash with the value of the upper and lower quartile (i.e. the values of the two ends of the box are the same as the value for the whiskers so the whiskers can’t be shown). I’ve just used this example to illustrate that these plots can look different sometimes. In any case, using the control as our baseline, the median is 3 and the story and exposure conditions have medians of 1, so they appear to reduce fear beliefs. The advert condition, however, has a median of 4 so the adverts appear to have increased fear beliefs. However, these conclusions are subjective. What we really need are some contrasts or post hoc tests like we used in ANOVA (see page 173).
Figure 7.3 Boxplot for the fear beliefs about clowns after exposure to different formats of information (adverts, stories, a real clown or nothing)
There aren’t any commonly used non-parametric post hoc procedures, but we can still easily test some hypotheses by using Mann-Whitney tests. If we want to use lots of Mann-Whitney tests we have to be careful because each time we do one there is a 5% chance that we’ll conclude that there is an effect when there isn’t (a Type I error). Remember in Chapter 6 I told you that the reason why we do tests like ANOVA is because they control the build up of Type I errors. Well, if we want to use lots of Mann-Whitney tests to follow up a Kruskal-Wallis test, then we have to make some kind of adjustment to ensure that the Type I errors don’t build up to more than .05. The easiest method is to use a Bonferroni correction, which in its simplest form just means that instead of using .05 as the critical value for significance for each test, you use a critical value of .05 divided by the number of tests you’ve conducted. If you do this, you’ll soon discover that you quickly end up using a critical value for significance that is so small that it is very restrictive. Therefore, it’s a good idea to be selective about the comparisons you make. In this example, we have a control group who had no clown information given to them. As such, a nice succinct set of comparisons would be to compare each group against the control:
Test 1: Advert compared to control
Test 2: Story compared to control
Test 3: Exposure compared to control
This results in three tests, so rather than use .05 as our critical level of significance, we’d use .05/3 = .0167. If we didn’t use focused tests and just compared all groups with all other groups we’d end up with six tests rather than three (advert vs. story, advert vs. exposure, advert vs. control, story vs. exposure, story vs. control, exposure vs. control) meaning that our critical value would fall to .05/6 = .0083.
SPSS Output 7.10 shows the test statistics from doing Mann-Whitney tests on the three focused comparisons that I suggested. Remember that we are now using a critical value of .0167, so the only comparison that is significant is when comparing the advert to the control group (because the observed significance value of .00 I is less than .0167). The other two comparisons produce significance values that are greater than .0167 so we’d have to say they’re nonsignificant. So the effect we got seems to mainly reflect the fact that McDonald’s adverts significantly increased fear beliefs about clowns relative to controls.
SPSS Output 7.10
Calculating an Effect-Size
Unfortunately there isn’t an easy way to convert a chi-square statistic that has more than I degree of freedom to an effect size r. You could use the significance value of the Kruskal-Wallis test statistic to find an associated value of z from a table of probability values for the normal distribution (like that provided in Field, 2000, p. 471). From this you could use the conversion to r that we used on page 238. However, to keep things simple, we could just calculate effect sizes for the Mann-Whitney tests that we used to follow up the main analysis. These effect sizes will be very informative in their own right.
For the first comparison (adverts vs. control) SPSS Output 7.10 shows us that z is –3.261, and because this was based on comparing two groups each containing 15 observations, we had 30 observations in total. The effect size is, therefore:
This represents a large effect (it is above Cohen’s benchmark of .5), which tells us that the effect of adverts relative to the control was a substantive effect.
For the second comparison (story vs. control) SPSS Output 7.10 shows us that z is –2.091, and this was again based on 30 observations. The effect size is, therefore:
This represents a medium to large effect (because it is between Cohen’s benchmarks of .3 and .5). Therefore, although non-significant the effect of stories relative to the control was a substantive effect.
For the final comparison (exposure vs. control) SPSS Output 7.10 shows us that z is –1.743, and this was again based on 30 observations. The effect size is, therefore:
This represents a medium effect. Therefore, although non-significant the effect of exposure relative to the control was a substantive effect.
Writing and Interpreting the Results
For the Kruskal-Wallis test, we need only report the test statistic (which we saw earlier is denoted by H), its degrees of freedom and its significance. So, we could report something like:
Children’s fear beliefs about clowns were significantly affected by the format of information given to them (H(3) = 17.06, p < .01).
However, we need to report the follow-up tests as well (including their effect sizes):
Children’s fear beliefs about clowns were significantly affected by the format of information given to them (H(3) = 17.06, p < .01). Mann-Whitney tests were used to follow-up this finding. A Bonferroni correction was applied and so all effects are reported at a .0167 level of significance. It appeared that fear beliefs were significantly higher after the adverts compared to the control, U = 37.50, r = –.60. However, fear beliefs were not significantly different after the stories, U = 65.00, ns, r = –.38, or exposure, U = 72.5, ns, r = –.32, relative to the control. We can conclude that clown information through stories and exposure did produce medium size effects in reducing fear beliefs about clowns, but not significantly so (future work with larger samples might be appropriate), but that Ronald McDonald was sufficient to significantly increas
e fear beliefs about clowns.6
7.5 Friedman’s ANOVA
* * *
Friedman’s ANOVA is the non-parametric equivalent of one-way repeated measures ANOVA (see page 183) and so is used for testing differences between experimental conditions when there are more than two conditions and the same participants have been used in all conditions (each person contributes several scores to the data).