How to Design and Report Experiments

Home > Other > How to Design and Report Experiments > Page 20
How to Design and Report Experiments Page 20

by Andy Field


  The final assumption for parametric tests is that we assume that our data have come from a population that has a normal distribution. The easiest way to check this assumption is to draw some histograms of your data to see the shape of the distribution (see Field, 2000, pp. 37–46). There are several problems with just looking at histograms. The first is that they tell us only about the distribution of the sample and not about the distribution of the population from which the sample came (although if the sample is large enough it will be a close enough approximation). A related point is that the distributions of small data sets (N < 30) will be messy (by which I mean you can’t get a good idea of the shape of a distribution from less than 30 observations). The final problem is obviously that looking at a distribution doesn’t tell us whether the distribution is different enough from normal to be a problem – it is a very subjective approach. Instead, we usually use objective tests of the distribution, and the two tests that SPSS provides are the Kolmogorov-Smirnov (or as one of my students recently called it ‘the vodka test’) and Shapiro-Wilk tests. These tests compare the set of scores in the sample to a normally distributed set of scores with the same mean and standard deviation. If the test is non-significant (p > .05) it tells us that the distribution of the sample is not significantly different from a normal distribution (i.e. it is probably normal). If, however, the test is significant (p < .05) then the distribution in question is significantly different from a normal distribution (i.e. it is non-normal). These tests are very handy because in one easy procedure they tell us whether our scores are normally distributed, however, we again need to be wary because the power of these tests depends upon the sample sizes we’ve used (see page 154).

  Let’s imagine we asked readers of this book to rate how bored they were by the time they reached this chapter. They just marked on a scale from 0 (incredibly interested) to 100 (I’m so bored I’m going to pull my own teeth out of my head). If we wanted to look at the distribution of these scores we could simply plot a histogram like the one in Figure 6.1. This histogram is based on a lot of data (10,000 cases actually) and so it has a very clear shape. However it does look quite skewed towards the top end of the scale (see page 113). This tells us that most scores were at the higher end of the scale (i.e. most people responded that they were so bored that they wanted to pull their own teeth out of their heads!). There are very few scores at the lower end of the scale (no-one was actually interested by the time they reached this chapter). However, this histogram alone only gives us a rough idea of the distribution, it doesn’t provide an objective measure of whether this distribution is non-normal.

  Figure 6.1 Histogram of readers’ boredom ratings

  Field (2000, pp. 46–49) shows how to do the Kolmogorov-Smirnov (K-S from now on) tests in SPSS. SPSS produces a table like that in SPSS Output 6.1, which includes the test statistic itself, the degrees of freedom (which should equal the sample size) and the significance value of this test. Remember that if the value in the column labelled Sig. is less than .05 then the distribution deviates significantly from normality. For the boredom scores the test is highly significant (p = .000), indicating that the distribution is not normal. This result reflects the strong negative skew in the data. The test statistic in this test is denoted by D and so we could report that our data significantly deviated from normality (D (10000) = .074, p < .001).

  SPSS Output 6.1

  Figure 6.2 Normal Q-Q plots of boredom scores

  SPSS also produces a normal Q-Q plot for any variables specified (see Figure 6.2). This chart plots the values you would expect to get if the distribution were normal (expected values) against the values actually seen in the data set (observed values). The expected values are shown as a straight diagonal line, whereas the observed values are plotted as individual points. Normally distributed data are shown by the observed values (the dots on the chart) falling exactly along the straight line (so the line is basically obscured by a load of dots). This would mean that the observed values are the same as you would expect to get from a normally distributed data set. If the dots deviate from the straight line then this represents a deviation from normality. In our boredom scores the dots form an arc that deviates away from the straight line at both ends. A deviation from normality such as this tells us that we cannot use a parametric test, because the assumption of normality is not tenable. In these circumstances we must turn to non-parametric tests as a means of testing the hypothesis of interest (see Chapter 7).

  6.2 The t-Test

  * * *

  The t-test is used in the simplest experimental situation; that is, when there are only two groups to be compared. The test statistic produced by the test is, unsurprisingly, called t and it is the ratio of the difference between means (i.e. the experimental effect) divided by an estimate of the standard error of the difference between those two sample means (see Field, 2000, Chapter 6). If you think back to what the standard error represents (page 132) you might remember that it was a measure of how well a sample represents the population. We can extend the standard error to situations in which we’re looking at the differences between sample means. We saw earlier in this chapter that if you took several pairs of samples from a population and calculated the difference between their means, then most of the time this difference will be close to zero. However, sometimes one or both of the samples could have a mean very deviant from the population mean and so it is possible to obtain large differences between sample means by chance alone. If you plotted these differences as a histogram, you would again have a sampling distribution with all of the properties previously described. The standard deviation of this sampling distribution would be the standard error of differences. A small standard error tells us that most pairs of samples from a population will have very similar means (i.e. the difference between sample means should normally be very small). A large standard error tells us that sample means can deviate quite a lot from the population mean and so differences between pairs of samples can be quite large by chance alone. The standard error of differences, therefore, gives us an estimate of the extent to which we’d expect sample means to be different by chance alone – it is a measure of the unsystematic variance, or variance not caused by the experiment. As such, the t-test is simply the difference between means as a function of the degree to which those means would differ by chance alone.

  6.3 The Independent t-Test

  You would plan to use the independent t-test if you are going to do an experiment using only two groups (you’re comparing two means), and different participants will be used in each group (so each person will contribute only one score to the data). The word independent in the name tells us that different participants were used.

  Example: One of my pet hates is ‘pop psychology’ books. Along with banishing Freud from all bookshops, it is my vowed ambition to rid the world of these rancid putrefaction-ridden wastes of trees. Not only do they give psychology a very bad name by stating the bloody obvious and charging people for the privilege, but they are also considerably less enjoyable to look at than the trees killed to produce them (admittedly the same could be said for the turgid tripe that I produce in the name of education but let’s not go there just for now!). Anyway, as part of my plan to rid the world of popular psychology I did a little experiment. I took two groups of people who were in relationships and randomly assigned them to one of two conditions. One group read the famous popular psychology book ‘Women are from Bras, men are from Penis’, whereas another group read ‘Marie Claire’. I tested only 10 people in each of these groups, and the dependent variable was an objective measure of their happiness with their relationship after reading the book. I didn’t make any specific prediction about which reading material would improve relationship happiness.

  SPSS Output for the Independent t-Test

  SPSS Output 6.2 shows the output from an independent t-Test; note that there are two tables. The first table summarizes the data for the two experimental conditions. From this table, we can see t
hat both groups had 10 participants (column labelled N). The group who read ‘Women are from Bras, men are from Penis’ had a mean relationship happiness score of 20, with a standard deviation of 4.11. What’s more, the standard error of that group (the standard deviation of the sampling distribution) was 1.30 when rounded off. In addition, the table tells us that the average relationship happiness levels in participants who read ‘Marie Claire’ was 24.2, with a standard deviation of 4.71, and a standard error1 of 1.49.

  The second table of output contains the main test statistics. There are actually two rows: one labelled Equal variances assumed, and the other labelled Equal variances not assumed. Earlier I said that parametric tests assume that the variances in experimental conditions are roughly equal and if the variances were not equal then we couldn’t use parametric tests; well, in reality adjustments can be made to the test statistic to make it accurate even when the group variances are different. The row of the table we use depends on whether the assumption of homogeneity of variances has been broken. To tell whether this assumption has been broken we could just look at the values of the variances and see whether they are similar (in this example the standard deviations of the two groups are 4.11 and 4.71 and if we square these values then we get the variances). However, this would be very subjective and so there is a test we use to see whether the variances are different enough to cause concern. Levene’s test (as it is known) resembles a t-test in that it tests the hypothesis that the variances in the two groups are equal (i.e. the difference between the variances is zero). So, if Levene’s test is significant at p < .05 then we conclude that the null hypothesis is incorrect and that the variances are significantly different – the assumption of homogeneity of variances has been violated. Conversely, if Levene’s test is non-significant (i.e. p > .05) then we accept the null hypothesis that the difference between the variances is roughly zero – the variances are more or less equal. For these data, Levene’s test is non-significant (because p = .492, which is greater than .05) and so we should read the test statistics in the row labelled Equal variances assumed. Had the probability of Levene’s test been less than .05, then we would have read the test statistics from the row labelled Equal variances not assumed.

  SPSS Output 6.2

  Having established that we have homogeneity of variances, we can move on to the more interesting business of the t-test itself. We are told the mean difference (20 – 24.2 = –4.2) and the standard error of the sampling distribution of differences (1.98). The t-statistic is calculated by dividing the mean difference by this standard error (t = –4.2/1.98 = –2.12). This value is the test statistic referred to on page 146. The value of t is then assessed against the value of t you might expect to get by chance when you have certain degrees of freedom. For the t-Test, degrees of freedom are calculated by adding the two sample sizes and then subtracting the number of samples (df = N1 + N2 – 2 = 10 + 10 – 2 = 18). SPSS produces the exact significance value of t, and we are interested in whether this value is less than or greater than .05. In this case the two-tailed value of p is .048, which is just smaller than .05, and so we could conclude that there was a significant difference between the means of these two samples. By looking at the means, we could infer that relationship happiness was significantly higher after reading ‘Marie Claire’ than after reading ‘Women are from Bras, men are from Penis’.2

  The final thing that this output provides is a 95% confidence interval for the mean difference. Imagine we took 100 samples from a population and compared the differences between scores in pairs of samples, then calculated the mean of these differences. We would end up with 100 mean differences between samples. The confidence interval tells us the boundaries within which 95 of these 100 mean differences would lie. So, 95% of mean differences will lie between –8.35 and –0.05. If this interval doesn’t contain zero then it tells us that in 95% of samples the mean difference will not be zero. This is important because if we were to compare pairs of random samples from a population we expect most of the differences between sample means to be around zero (in reality some will be slightly above zero and others will be slightly below). If 95% of samples from our population all fall on the same side of zero (are all negative or all positive) then we can be confident that our two samples do not represent random samples from the same population. Instead they represent samples from different populations induced by the experimental manipulation – in this case, the book that they read.

  Calculating the Effect Size for the Independent t-Test

  Even though our t-statistic is statistically significant, this tells us nothing of whether the effect is substantive (i.e. important in practical terms): as we’ve seen highly significant effects can in fact be very small. To discover whether the effect is substantive we need to turn to what we know about effect sizes. It, therefore, makes sense to tell you how to convert your t-statistic into a standard effect size. As I mentioned, I’m going to stick with the effect size r because it’s widely understood and frequently used. Converting a t-value into an r-value is actually really easy; we can use the following equation (from Rosenthal, 1991, p. 19):

  We know the value of t and the df from the SPSS output and so we can compute r as follows:

  If you think back to our benchmarks for effect sizes this represents a fairly large effect (it is just below .5 – the threshold for a large effect). Therefore, as well as being statistically significant, this effect is large and so represents a substantive finding.

  Interpreting and Writing the Result for the Independent t-Test

  When you report any statistical test you usually state the finding to which the test relates, report the test statistic (usually with its degrees of freedom), the probability value of that test statistic, and more recently the American Psychological Association are, quite rightly, requesting an estimate of the effect size. To get you into good habits early, we’ll start thinking about effect sizes now – before you get too fixated on Fisher’s magic .05. In this example we know that the value of t was –2.12, that the degrees of freedom on which this was based was 18, and that it was significant at p = .048. This can all be obtained from the SPSS output. We can also see the means for each group. Based on what we learnt about reporting means on page 136, we could now write something like:

  On average, the reported relationship happiness after reading ‘Marie Claire’ (M = 24.20, SE = 1.49), was significantly higher than after reading ‘Women are from Bras, men are from Penis’ (M = 20.00, SE = 1.30), t(18) = 2.12, p < .05, r = .45.

  Note how we’ve reported the means in each group (and standard errors) in the standard format. For the test statistic note that we’ve used an italic t to denote the fact that we’ve calculated a t-statistic, then in brackets we’ve put the degrees of freedom and then stated the value of the test statistic. The probability can be expressed in several ways (see Chapter 11): often people report things to a standard level of significance (such as .05) as I have done here. Other times people will report the exact significance. Finally, note that I’ve reported the effect size at the end – you won’t see this that often in published papers but times are changing and you soon should see effect sizes being reported as standard practice. Equally valid, would be to draw a graph of our group means (including error bars – see Chapter 4) and then to report:

  The reported relationship happiness after reading ‘Marie Claire’, was significantly higher than after reading ‘Women are from Bras, men are from Penis’, t(18) = –2.12, p = .048, r = .45.

  Note how I’ve now reported the exact significance. I could also state early on my criterion for the probability of a Type I error, and then not report anything relating to the probability:

  All effects will be reported at a .05 level of significance. The reported relationship happiness after reading ‘Marie Claire’ was significantly higher than after reading ‘Women are from Bras, men are from Penis’, t(18) = –2.12, r = .45. The effect size estimate indicates that the difference in relationship happiness created by
the reading material represents a large, and therefore substantive, effect.

  Try to avoid writing vague, unsubstantiated things like this:

  People were happier after reading ‘Marie Claire’, t = –2.12.

  Happier than what? Where are the df? Was the result statistically significant? Was the effect important (what was the effect size)?

  6.4 The Dependent t-Test

  * * *

  The dependent, or matched-pairs, t-test is used in the same situation as the previous test, except that it is designed for situations in which the same participants have been used in both experimental conditions. So, you’d plan to use it when you will have two experimental conditions (you will compare 2 means), and the same participants will be used in both conditions (so each person will contribute 2 scores to the data). The phrase ‘dependent’ in the name tells us that the same participants have been used.

 

‹ Prev