How to Design and Report Experiments

Page 18

by Andy Field

Now, what happens if we do something to one of our samples? Well, if we took one sample of lecturers and removed their brains then we would expect the mean in this sample to change also – we might predict that their lecturing got worse (although on thinking about it, it would probably get better!). So, we’d expect the sample means to be different because one of the sample means will be lower because we removed the brains of the people within it. This sample no longer represents the population of lecturers: it represents the population of lecturers that have had their brains removed – a different population! When we do an experiment like this and we find that our samples have very different means then there are two possible explanations:

Figure 5.1 A basic experiment to test whether the removal of a lecturer’s brain affects their lecturing ability

The manipulation that was carried out on the participants (i.e. removal of their brain) has changed the thing we’re measuring (in this case lecturing skills). The implication is that our samples have come from two different populations, or

The samples come from the same population but have different means because we have inadvertently selected samples that are very different from each other (because of the characteristics of the people within them). The implication here is that the experiment has had no effect and that the observed difference between the samples is just a fluke.

When sample means are very different then it is very unlikely that they have come from the same population (because we know from sampling distributions that the majority of samples will have quite similar means). So, in simple terms, the bigger the difference between our sample means, the more unlikely it is that these samples came from the same population – and the more confident we can be that our experiment has had an effect. When sample means are fairly similar it becomes more probable that they could have come from the same population – so we become less confident that our experiment has had an effect.

These ideas are the basis of hypothesis testing. Simplistically, we can calculate the probability that two samples come from the same population; when this probability is high then we conclude that our experiment has had no effect (the null hypothesis is true), but when this probability is very small, we conclude that the experiment has had an effect (the experimental hypothesis is true). How do we decide what probability is small enough to accept that our experimental hypothesis is true? Well, again, this is where we use Fisher’s criterion of 5% (which expressed as a probability value is .05). If the probability of the two samples being from the same population is less than or equal to .05 (i.e. I in 20 or less) then we reject the null hypothesis and accept that our experimental manipulation has been successful.

How Do We Calculate the Probability that our Samples are from the Same Population?

Actually, this is not a straightforward question to answer because the exact method depends upon your experimental design and the test you’re using. However, it is possible to outline some general principles. We’ve just seen that in almost any experiment there are two basic sources of variation (see also Chapter 3):

Systematic variation: This variation is due to the experimenter doing something to all of the participants in one sample but not in the other sample or samples.

Unsystematic variation: This is variation due to natural differences between people in different samples (such as differences in intelligence or motivation).

In essence, whenever we’re trying to find differences between samples we have to calculate a test statistic. A test statistic is simply a statistic that has known properties; specifically we know its frequency distribution. By knowing this, we can work out the probability of obtaining a particular value. A good analogy to this is the age at which people die. We know the distribution of the age of death from years of past experience and data. So, we know that on average men die at about 75 years old, we also know that this distribution is fairly top-heavy; that is, most people die above the age of about 50 and it’s fairly unusual to die in your 20s (he says with a sigh of relief!). So, the frequencies of the age of demise at older ages are very high but are lower at younger ages. From these data, it would be possible to calculate the probability of someone dying at a certain age. If we randomly picked someone and asked them their age, and it was 57, we could tell them how likely it is that they will die before their next birthday (this is a great way to become very unpopular with older relatives!). Also, say we met a man of 112, we could calculate how probable it was that he would have lived that long (this probability would be very small – most people die before they reach that age). The way we use test statistics is rather similar: we know their distributions and this allows us, once we’ve calculated the test statistic, to discover the probability of having found a value as big as we have. So, if we calculated a test statistic and its value was 112 (rather like our old man) we can then calculate the probability of obtaining a value that large.

So, how do we calculate these test statistics? Again, this will depend on the design of the experiment and the test you’re using, however, when comparing the means of different samples, the test statistics typically represent the same thing:

The exact form of this equation changes from test to test, but essentially we’re always comparing the amount of variance created by an experimental effect against the amount of variance due to random factors (see Box 5.1 for a more detailed explanation). The reason why this ratio is so useful is intuitive really: if our experiment has had an effect then we’d expect it to create more variance than random factors alone. In this case, the test statistic will always be greater than I (but not necessarily significant). If the experiment has created as much variance as random factors then the test statistic will be 1 exactly, and it means that our manipulation hasn’t created any more variance than if we’d done nothing (i.e. the experimental manipulation hasn’t done much). If the random factors have created more variance than the experimental manipulation then this means that the manipulation has done less than if we hadn’t done anything at all (and the test statistic will be less than 1).

Box 5.1: Sources of variance – a piece of cake!

Imagine we were interested in whose music was scarier: Slipknot or Marilyn Manson. We could play a track from each band to different people and measure their heart rate (which should indicate arousal). Now the heart rate scores we get will vary: different people will have different heart rates. If we want to find out by how much people’s scores vary we could calculate the sum of squared errors between all scores and the mean of all scores (see the section beginning on page 120) – this is known as the total sum of squares, SST This is a crude measure of the total difference between heart rate scores. In Chapter 3 we also saw that there are basically two explanations for variability in scores: the experimental manipulation and individual differences. The amount of variability caused by individual differences in people can be calculated by looking at the sum of squared differences between each person’s score and the mean of the group to which they belong. This sum of squared errors is sometimes called the residual sum of squares, SSR. The effect of our experiment can also be measured by a sum of squared errors called the model sum of squares, SSM. The exact details of how to calculate these values are beyond this book, but if you’re interested I go into it in my other book (Field, 2000). The variability between scores is a bit like a big cake. The total sum of squares tells us how big our cake is to begin with: a large SST tells us we’re dealing with a huge five tier wedding cake (there is a lot of variation in heart rate) whereas a small value indicates that we have a small muffin (there are very few differences in heart rate)! This cake can be cut into two. Imagine that our experiment is a person, now we’re interested in how much of this cake this person can eat. If the experiment is successful then it will eat a lot of cake, so there won’t be much left over, but if it is unsuccessful then it will only nibble a few crumbs and then leave the rest. Put another way, we hope that our experiment is a big greedy cake-loving lard-monster who’ll gobble up most
of the cake; conversely it would be a disaster if our experiment turned out to be Kate Moss! If there are a lot of differences between heart rates after listening to Slipknot compared to Marilyn Manson then the model sum of squares will be big. We could think of this like Marilyn Manson and Slipknot having eaten a lot of the cake (of course I realize that both bands would probably prefer to eat goats and that sort of thing but let’s assume that after a hard day of corrupting today’s youth they like a nice piece of cake) – and there will be only a small piece of cake left over. Figure 5.2 shows how the total variation (the cake) is broken into two constituent pieces. Imagine in this example we started with 50 units of variation (so, SST = 50) and the experiment could explain 30 of these units (so, Marilyn and Slipknot ate 30 of the 50 units of cake, SSM = 30). That means there are 20 units left that can’t be explained (SSR = 20).

Now if we want to know something about the success of our experiment, there are two ratios we can look at. The first thing to do is to compare the effect of the experiment to the effect of nothing. This ratio is usually adapted to be used as a test statistic:

Figure 5.2 The variation between scores is like a cake: we can cut it into two slices: one represents the effect of the experiment and the other represents individual differences or random factors

In this example the test statistic would be proportionate to (that’s what the symbol means) 30/20, or 1.5. Another useful measure is how much of the cake the experiment can account for:

This ratio is similar to the past one except that this time we divide by the initial size of the cake (the total variation). In this example we’d get 30/50 or 0.6. This value is a proportion so we can express it in terms of a percentage by multiplying by 100 (in this case it becomes 60%). So we could say that listening to Slipknot instead of Marilyn Manson can explain 60% of the total variability in heart rates. This ratio is known as r-square, or the coefficient of determination, and if you take the square root of the value, you get r, the Pearson correlation coefficient (see Field, 2000, for more details on partitioning variance).

Once we’ve calculated a particular test statistic we can then use its frequency distribution to tell us how probable it was that we got this value. As we saw in the previous section, these probability values tell us how likely it is that our experimental effect is genuine and not just a chance result. If our experimental effect is large compared to the natural variation between samples then the systematic variation will be bigger than the unsystematic variation and the test statistic will be greater than 1. The more variation our manipulation creates compared to the natural variation, the bigger the test statistic will be. The bigger the test statistic is, the more unlikely it is to occur by chance (like our 112 year old man). So, we find that as test statistics get bigger, the probability of them occurring becomes smaller. When this probability falls below .05 (Fisher’s criterion), we accept this as giving us enough confidence to assume that the test statistic is as large as it is because of our experimental manipulation – and not because of random factors. Put another way, we accept our experimental hypothesis and reject our null hypothesis – however, Box 5.2 explains some common misconceptions about this process.

Two Types of Mistake

We have seen that we use inferential statistics to tell us about the true state of the world (to a certain degree of confidence). Specifically, we’re trying to see whether our experimental manipulation has had some kind of effect. There are two possibilities in the real world: our experimental manipulation has actually made a difference to our samples, or our experimental manipulation has been completely useless and had no effect whatsoever. We have no way of knowing which of these possibilities is true. We look at test statistics and their associated probability to tell us which of the two is more likely. It is important that we’re as accurate as possible: we don’t want to make mistakes about whether our experiment has had an effect. This is why Fisher originally said that we should be very conservative and only accept a result as being genuine when we are 95% confident – or when there is only a 5% chance that the results could occur by chance. However, even if we’re 95% confident there is still a small chance that we get it wrong. In fact there are two mistakes we can make:

Type I error: This is when we believe that our experimental manipulation has been successful, when in fact it hasn’t. This would occur when, in the real world our experimental manipulation has no effect, yet we have got a large test statistic because we coincidentally selected two very dissimilar samples. So, the sample means were very different but not because of the experimental manipulation. If we use Fisher’s criterion then the probability of this error is .05 (or 5%) when the experiment has no effect – this value is known as the α-level. Assuming our experiment has no effect, if we replicated the experiment 100 times we could expect that on five occasions the sample means would be different enough to make us believe that the experiment had been successful, even though it hadn’t.

Box 5.2: What we can and can’t conclude from a significant test statistic

The importance of an effect: We’ve seen already that the basic idea behind hypothesis testing involves us generating an experimental hypothesis (the means of our experimental conditions will differ) and a null hypothesis (the means of our experimental conditions will be the same). If the probability of obtaining the value of our test statistic by chance is less than .05 then we generally accept the experimental hypothesis as true: our means are indeed different. Normally we say ‘our experiment had a significant effect’. However, don’t be fooled by that word ‘significant’, because even if the probability of our effect being a chance result is small (less than .05) it doesn’t necessarily follow that the effect is important. Very small and unimportant effects can turn out to be statistically significant just because huge numbers of people have been used in the experiment (see page 152).

Non-significant results: Once you’ve calculated your test statistic, you calculate the probability of that test statistic occurring by chance; if this probability were greater than .05 you reject your experimental hypothesis. However, this does not mean that the null hypothesis is true. Remember that the null hypothesis is that the means in different groups are identical, and all that a non-significant result tells us is that the means are not different enough to be anything other than a chance finding. It doesn’t tell us that the means are the same; as Cohen (1990) points out, a non-significant result should never be interpreted (despite the fact it often is) as ‘no difference between means’ or ‘no relationship between variables’. Cohen also points out that the null hypothesis is never true because we know from sampling distributions (see page 132) that two random samples will have slightly different means, and even though these differences can be very small (e.g. one mean might be 10 and another might be 10.00000000000000000001) they are nevertheless different. In fact, even such a small difference would be deemed as statistically significant if a big enough sample were used (see page 152). So, significance testing can never tell us that the null hypothesis is true, because it never is!

Significant results: OK, we may not be able to accept the null hypothesis as being true, but we can at least conclude that it is false when our results are significant, right? Wrong! A significant test statistic is based on probabilistic reasoning, which severely limits what we can conclude. Again, Cohen (1994), who was an incredibly lucid writer on statistics, points out that formal reasoning relies on an initial statement of fact followed by a statement about the current state of affairs, and an inferred conclusion. This syllogism illustrates what I mean:

• If a man has no legs then he can’t play football. This man plays football.

Therefore, this man has legs.

The syllogism starts with a statement of fact that allows the end conclusion to be reached because you can deny the man has no legs (the antecedent) by denying that he can’t play football (the consequent). A comparable statement of the null hypothesis would be:

• If the null hypothesis is correct
, then this test statistic can not occur. This test statistic has occurred.

Therefore, the null hypothesis is false.

This is all very nice except that the null hypothesis is not represented in this way because it is based on probabilities. Instead it should be stated as follows:

• If the null hypothesis is correct, then this test statistic is highly unlikely. This test statistic has occurred.

‹ Prev Next ›