How to Design and Report Experiments
Page 29
Example: A psychologist was interested in the effects of television programmes on domestic life. She hypothesized that through vicarious learning7 certain programmes might actually encourage people to behave like the characters within them. This in turn could affect the viewer’s own relationships (depending on whether the programme depicted harmonious or dysfunctional relationships). She took episodes of three popular TV shows, and showed them to 54 couples after which the couple were left alone in the room for an hour. The experimenter measured the number of times the couple argued. Each couple viewed all three of the TV programmes at different points in time (a week apart) and the order in which the programmes were viewed was counterbalanced over couples. The TV programmes selected were Eastenders (which typically portrays the lives of extremely miserable, argumentative, London folk who like nothing more than to beat each other up, lie to each other, sleep with each other’s wives and generally show no evidence of any consideration to their fellow humans!), Friends (which portrays a group of unrealistically considerate and nice people who love each other oh so very much – but for some reason I love it anyway!), and a National Geographic programme about whales (this was supposed to act as a control).
SPSS Output 7.11 shows the Kolmogorov-Smirnov test (see page 160) for each condition. This test was significant for the data generated after Eastenders (D(54) = .137, p < .05), Friends (D(54) = .150, p < .01) and the National Geographic programme about whales D(54) = .121, p < .05). Given that all the data are non-normal a non-parametric test is appropriate.
SPSS Output 7.11
Output from Friedman’s ANOVA
Friedman’s ANOVA works similarly to if you conducted a one-way repeated measures ANOVA on the ranked data. So, like all the non-parametric tests it is based on the ranks, not the actual scores. SPSS Output 7.12 shows the mean rank in each condition. These mean ranks are important later for interpreting any effects; they show that the ranks were highest after watching Eastenders.
If we were doing Friedman’s ANOVA by hand we would take the scores for a given participant and rank them across experimental conditions – from lowest to highest. For example, if a couple had 15 arguments after watching Eastenders, 6 after watching Friends and 7 after watching the whales, then we would give Friends a rank of 1 (because it produced the lowest score), the whales a rank of 2, and Eastenders a rank of 3 (because it generated the highest score). If there were no effect of the programmes, then we’d expect these ranks to be fairly randomly distributed across the conditions, and the mean rank for each condition would be fairly similar to each other. If one programme does produce more arguments, then we’d expect the scores to generally be highest in that condition, so its ranks will generally be higher and the mean rank will be higher than the other conditions.
SPSS Output 7.12
The test statistic (which is denoted by the usual symbol for a chi-square statistic, χ2) is a function of the total of the ranks in each condition, the sample size on which they are based, and the number of conditions there were (k). Like the Kruskal-Wallis test, the test statistic has a chi-square distribution with k – 1 degrees of freedom (k is the number of conditions). SPSS Output 7.13 shows this Chi-Square test statistic and its associated degrees of freedom (in this case we had 3 groups so the degrees of freedom are 3 – 1, or 2), and the significance. Therefore, we could conclude that the type of programme watched significantly affected the subsequent number of arguments (because the significance value is less than .05). However, like a one-way ANOVA, this result doesn’t tell us exactly where the differences lie.
SPSS Output 7.13
To get some idea of where the differences lie we could examine a boxplot of the conditions like the one in Figure 7.4. We can see from this graph that after watching Eastenders the median number of arguments was 8 and scores ranged from 5 to 15 (although that highest score was an outlier and most scores actually fell between 5 and 13). After watching Friends the median number of arguments was only 6 and scores ranged from 0 to 11. After watching the whale programme the median number of arguments was 7 and scores ranged from 0 to 13. So, if we were to draw some subjective conclusions they might be that Eastenders led to more arguments than watching Friends or whales, and that watching Friends led to fewer arguments than watching whales. Like with the Kruskal-Wallis test though we really need some contrasts or post hoc tests like we used in ANOVA (see page 173); however, we have the same problem in that such contrasts can’t readily be done. The solution is much the same: we do lots of tests that compare only two conditions, but adjust the critical value of significance for the number of tests we do (see page 173). In this case, because our conditions used the same participants we’d use Wilcoxon tests to follow up the analysis (see page 239).
When thinking about how to follow-up the analysis it’s a good idea to be selective about the comparisons you make. In this example, we have a control condition in which people watched a programme about whales. A nice succinct set of comparisons would be to compare each group against the control:
Figure 7.4 Boxplot for the number of arguments had after watching Eastenders, Friends or a National Geographic programme about whales
Test 1: Eastenders compared to control
Test 2: Friends compared to control
This gives rise to only two tests, so rather than use .05 as our critical level of significance, we’d use .05/2 = .025 (see page 173). SPSS Output 7.14 shows the test statistics from doing Wilcoxon tests on the two comparisons that I suggested. Remember that we are now using a critical value of .025, so we compare the significance of both test statistics against this critical value. The test comparing Eastenders to the National Geographic programme about whales has a significance value of .005, which is well below our criterion of .025, therefore, we can conclude that Eastenders led to significantly more arguments than the programme about whales. The second comparison compares the number of arguments after Friends with the number after the programme about whales. This contrast is non-significant (the significance of the test statistic is .530, which is bigger than our critical value of .025), so we can conclude that there was no difference in the number of arguments after watching Friends compared to after watching the whales. The effect we got seems to mainly reflect the fact that Eastenders makes people argue more.
SPSS Output 7.14
Calculating an Effect-Size
As I mentioned before, there isn’t an easy way to convert a chi-square statistic that has more than 1 degree of freedom to an effect size r. As with the Kruskal-Wallis test you could use the significance value of the chi-square test statistic to find an associated value of z from a table of probability values for the normal distribution and then use the conversion to r on page 238. Alternatively, we could just calculate effect sizes for the Wilcoxon tests that we used to follow up the main analysis. These effect sizes will be very informative in their own right.
For the first comparison (Eastenders vs. control) SPSS Output 7.14 shows us that z is –2.813, and because this was based on comparing two groups each containing 54 observations, we had 108 observations in total (remember it isn’t important that the observations come from the same people). The effect size is, therefore:
This represents a medium effect (it is close to Cohen’s benchmark of .3), which tells us that the effect of Eastenders relative to the control was a substantive effect: Eastenders produced substantially more arguments.
For the second comparison (Friends vs. control) SPSS Output 7.14 shows us that z is –.629, and this was again based on 108 observations. The effect size is, therefore:
This represents virtually no effect (it is close to zero). Therefore, Friends had very little effect in creating arguments compared to the control.
Writing and Interpreting the Results
For Friedman’s ANOVA we need only report the test statistic (which we saw earlier is denoted by χ2),8 its degrees of freedom and its significance. So, we could report something like:
The number of arguments that co
uples had was significantly affected by the programme they had just watched, χ2(2) = 7.59, p < .05.
We need to report the follow-up tests as well (including their effect sizes):
The number of arguments that couples had was significantly affected by the programme they had just watched, χ2(2) = 7.59, p < .05. Wilcoxon tests were used to follow-up this finding. A Bonferroni correction was applied and so all effects are reported at a .025 level of significance. It appeared that watching Eastenders significantly affected the number of arguments compared to the programme about whales, T = 330.50, r = –.27. However, the number of arguments was not significantly different after Friends compared to after the programme about whales, T = 462, ns, r = –.06. We can conclude that watching Eastenders did produce significantly more arguments compared to watching a programme about whales, and this effect was medium in size. However, Friends didn’t produce any substantial reduction in the number of arguments relative to the control programme.
7.6 Summary
* * *
In the last chapter, we saw that data do not always conform to the conditions necessary to conduct parametric statistics (life is cruel like this sometimes!). This chapter has taught us about some of the tests we can do when the assumptions of parametric tests have not been met. We started by having a brief look at how we can use ranked data instead of the actual scores. We then moved on to look at simple situations in which there are only two experimental conditions (the Mann-Whitney and Wilcoxon tests) before moving on to more complex situations in which there are several experimental conditions (the Kruskal-Wallis test and Friedman’s ANOVA). We looked at each test by using an example for which we examined the SPSS output and looked at how we could interpret the results and report our findings and conclusions. The next chapter will summarize all of the tests we’ve encountered and look at how we can decide whether we (or others) have used the appropriate test.
7.7 Practical Tasks
For each example in this chapter:
Write a sample results section (using the summaries at the end of each example as a guide) but as you do so identify where all of the numbers come from and what they mean.
For each example in this chapter use Field (2000) or notes on my web page (http://www.cogs.susx.ac.uk/users/andyf/teaching):
Analyse the data and see if you can get the same outputs as I have (the SPSS data files can be found on my website).
7.8 Further Reading
Field, A. P. (2000). Discovering statistics using SPSS for Windows: advanced techniques for the beginner. London: Sage. Chapter 2 talks about non-parametric statistics.
Howell, D. C. (2001). Statistical methods for psychology (5th edition). Belmont, CA: Duxbury. Chapter 18 gives the theory behind all of the tests covered in this chapter.
Siegel, S. & Castellan, N. J. (1988). Nonparametric statistics for the behavioral sciences (2nd edition). New York: McGraw-Hill. This is still the definitive reference for non-parametric statistics but is quite technical.
Notes
1 You’ll sometimes see non-parametric tests referred to as distribution-free tests, with an explanation that they make no assumptions about the distribution of the data. This is actually incorrect: they do make assumptions, but they are less restrictive ones than their parametric counterparts.
2 Incidentally, this isn’t an experiment in the pure sense because the groups are not created randomly (despite the advances of science we have yet to be able to assign random chunks of DNA to experimental conditions and then grow them into dogs or men!).
3 By this I mean data that do not meet the assumptions of a parametric test.
4 These are fictitious data and there is no evidence whatsoever that Britney makes you sacrifice goats.
5 Unfortunately, the first time they attempted the study the clown accidentally burst one of the balloons. The noise frightened the children and they associated that fear response with the clown. All 15 children are currently in therapy for coulrophobia!
6 Another disclaimer: McDonald’s are great really, and Ronald is such a nice friendly chap – honest!
7 This is just learning through observing others (Rachman, 1977).
8 The test statistic is often denoted as : but the official APA style guide doesn’t recognize this term.
8 Choosing a Statistical Test
* * *
One of the most bewildering aspects of experimental design for students is the issue of how to choose an appropriate statistical test. There are two ways in which this problem crops up. First, in statistics tests and exams, you might be presented with a hypothetical study and be required to select the statistical test that would be most suitable for analysing the data obtained from it. At first sight, this might seem rather an artificial situation. However, in real life (whatever that is!), researchers have to decide for themselves whether other researchers’ statistics (as described in journal articles and conference papers) were appropriate. So this task is actually not quite as bizarre as it might at first appear. Second, in your course, you may be faced with the problem of designing a study from scratch, and having to decide what to measure and how to analyse the data obtained. This has more obvious parallels to what researchers do.
8.1 The Need to Think About Statistics at the Outset of Designing a Study
* * *
In Chapter 2 we saw that researchers make decisions about which statistics they intend to use while they are designing their study – they never treat this as something to be thought about once the data are collected. Thinking about which inferential and descriptive statistics you are going to use should be an integral part of designing a study. If we are designing an experiment, our first thought is ‘what will this experiment tell us about the phenomenon in question?’, but this is followed almost immediately by ‘how will we measure it – what kind of data will we collect?’ (see Chapter 2) and ‘which statistical tests will we use?’ (see Chapters 5, 6 and 7).
Sometimes students are tempted to run their study first, and then worry later about which test they will use on the data once they’ve got them. However, this is bad practice. If you defer thinking about statistics until after you have done your study, you run the risk of obtaining data that are totally unanalysable. All too frequently, Andy and I have had to explain to students that the reams of data they have so laboriously collected are effectively worthless, because there is no statistical test that can be used on them. This is very depressing for us, and I expect the student gets upset too. (I fell into this trap myself, when I started the research that led to my Ph.D. In my enthusiasm to get started, I spent many months collecting data but never thought beforehand what I would actually do with it! I then spent many more months in visits to one of the best statisticians in the country, who would look at my data, sigh, and then rack his brains trying to think of some elaborate statistical treatments that might give me valid results. My life would have been so much easier had I given some thought to statistics at the outset, because only minor changes to my experimental procedures would have been needed to have made the statistics quite straightforward.)
At best, by designing a study without taking into account how you are going to analyse the data, you may end up with data that are not as informative as they might have been. Obtaining participants and running them in a study is tedious and hard work, and it wastes everyone’s time and effort if you run a study that produces useless data. Therefore, it pays to spend a little time thinking about what kind of data you intend to obtain, and how you would analyse it.
Here is a demonstration of how minor differences in how you design the study can have a big effect on what kind of data you obtain, and ultimately, on what conclusions you would be able to draw. Suppose you were interested in whether a new striping pattern improved the detectability of emergency vehicles, compared to the patterning that was already in use. How could you test this? One way might be to get a hundred participants and ask each one whether the new pattern was more detectable than the old one; less detectab
le; or no different in detectability. If you did this, you would end up with data that consisted of three frequencies – the number of people saying ‘the new pattern is an improvement’, the number of people saying ‘the new pattern is worse’, and the number who thought the new pattern was no different to the old one. All you could do with these three frequencies is run a Chi-Squared test (see Box 8.1), to see if they differed from the equal frequencies for the three categories that you would expect to get if people’s responses were random. If this test was statistically significant, it would tell you that people’s responses were, indeed, non-random – that responses were not equally distributed amongst the three categories. Suppose most people had said that the new striping pattern was an improvement. This would be informative, but it wouldn’t really tell you very much. It would tell you nothing about whether the new pattern was a big improvement or just a small improvement over the old one. How homogeneous is the category of people who thought the new pattern was an ‘improvement’? There are all kinds of possibilities. Everyone might have been very similar in their assessment of the new pattern. Alternatively, responses might have been sharply divided, with some participants considering the new pattern to be a huge improvement, while others thought it was only a slight improvement. We have absolutely no way of telling, from the data that we have collected, which of these possibilities has occurred. We would have gone to all of the trouble of obtaining the emergency vehicles, the stripes and the participants, merely to obtain three crude measures (the number of people falling into the three categories of opinion).