How to Design and Report Experiments
Page 4
Box 1.3: When is a control condition not a control condition?
There are many years of research indicating that if you present a stimulus that evokes a neutral response alongside one that evokes an affective response (positive or negative) then after many of these pairings the neutral stimulus will come to evoke the same emotional response as the stimulus with which it was paired. This process is rather like what happens in advertising: a relatively neutral product (e.g. a car) is presented alongside a favourable stimulus (e.g. a semi-naked celebrity) in the hope that the product will become associated with the positive emotion evoked by the celebrity. This kind of learning is called evaluative learning. The reason why the neutral stimulus comes to evoke an affective response is thought to be that it becomes associated with the other stimulus (i.e. the thought of one evokes some connection in memory to the other).
In a typical experiment, several neutral stimuli are rated according to how much a person likes or dislikes the stimulus. Then, some of these neutral stimuli are presented alongside liked stimuli, some presented with disliked stimuli and some with other neutral stimuli. The neutral stimuli are then re-rated to see how much participants now like or dislike them. Typically, neutral stimuli presented alongside positive images are subsequently rated more positively, those paired with negative imagery are rated more negatively, and those paired with other neutral stimuli do not change rating (see the example graph). So, the responses evoked by the neutral stimuli change because of them being presented with stimuli that are liked (or disliked) – see De Houwer, Thomas and Baeyens (2001) for a review.
Researchers in this area had agreed that this paradigm was a suitable way of showing that presenting a neutral stimulus with an affect-evoking stimulus causes the change in response to the neutral stimulus. The proposed cause is the ‘pairing’ process (i.e. repeated presentation of a neutral stimulus with one that evokes a response). They argue that because ratings to neutral stimuli presented alongside other neutral stimuli do not change whereas those presented with liked or disliked stimuli do change, the only explanation is that subjects associate the neutral pictures with those that evoke an affective response (e.g. Baeyens, De Houwer, Vansteenwegen, & Eelen, 1998; Baeyens, Eelen & Crombez, 1995). Is this conclusion correct?
An alternative explanation
Graham Davey and I argued (Field & Davey, 1998) that the paradigm just described does not demonstrate that the pairing process is the causal factor because there is no condition in which neutral stimuli are not paired with anything. So, even though there is a control in the sense that some neutral pictures are paired with other neutral pictures, this control does not eradicate the predicted cause. How does this fit in with Mill’s ideas about isolating cause and effect? What is needed is some condition in which participants are exposed to the same pictures, the same number of times, but the neutral pictures are never presented alongside positive or negative pictures. I even devised such a control condition (Field, 1996, 1997) that allowed researchers to compare one situation in which neutral pictures are presented alongside affect-evoking pictures with one in which the neutral pictures are presented completely separately to positive and negative ones. Interestingly, I used this control condition (Field & Davey, 1997, 1999) and found identical results to those of the experimental condition. This showed that the effects that had previously been attributed to the presentation process (the pairing of neutral pictures with liked or disliked pictures) could not possibly have been caused by this factor because even when neutral pictures were not presented alongside affect-evoking pictures the results were the same. It turned out that participants were doing something completely different in the experiment that created the illusion of them learning (there was a tertium quid at work). Without the kinds of controls derived from Mill’s thinking, these confounding variables cannot be ruled out.
Some things to think about
How does this real-life example fit in with Popper’s notion of falsification? Well, there was a lot of corroborating evidence for Baeyens et al.’s interpretation of this paradigm. However, Field and Davey’s use of a new paradigm threw up evidence that contradicted Baeyens’ hypothesis. This illustrates Popper’s idea of falsification in that we took an established hypothesis and compared it with a competing one and then used the data to falsify one and corroborate the other.
Killing the tertium quid II: randomization
In the previous section, I mentioned in passing that in the mobile phone experiment we might want to make sure that all participants had an equivalent level of brain health. This comment actually hid an important point about how we allocate participants to our various groups. We can rule out many random influences on our experiment simply by randomizing parts of the study. The first important thing to do is to make sure participants are randomly allocated to your experimental and control groups. Imagine we worked in a head-injury unit at a hospital and we decided to use these people as participants in our experimental group and we then found a group of strangers for our control. The experimental group contains people that we know have head injuries so they might already have tumours or abnormalities from their injury. The other group (who have been allocated to the control) has no such history. If you gave your head-injury friends phones to use for a week and then concluded that phones caused brain abnormalities, would this conclusion be accurate? Well, of course it wouldn’t because the group that used phones already had problems. Although this example is extreme, it illustrates the point I’m trying to make: we should not allow any systematic bias into our experiment. We could use our head-injury friends only if we randomly (and preferably evenly) allocated them to both groups (that way we know that any existing head injuries are present in both groups and so should cancel out). If you think of how complex the average human is (how much humans differ in intelligence, motivation, emotional expression, physical characteristics) you should realize how important it is to randomly allocate people to experimental groups: it ensures a roughly equivalent spread of attributes across all groups.
Comparing theories: statistics
Section 2 of this book talks in more detail about statistics. Suffice to say at this stage that once we have our experimental conditions that have controlled for confounding variables and have isolated causal factors, we need some objective way of comparing one condition with another. In fact, we use mathematics to help us out with this task. The basic idea behind most modern statistics can be illustrated with an example from the man who you should blame next time you’re bored witless in a statistics lecture: Ronald Fisher. Fisher’s (1925) contribution to statistics and methodology was so groundbreaking that it was republished some 66 years later. In Fisher’s (1925) book, he describes an experiment designed to test a claim by a lady that she could determine, by tasting a cup of tea, whether the milk or the tea was added first to the cup. Anyway, Fisher’s line of thinking was that he should give the lady some cups of tea, some of which had the milk added first and some of which had the milk added last and see whether she could correctly identify them (a bit like the Pepsi challenge1 for those of you old enough to have seen the adverts). In all cases discussed, the lady knows that there are an equal number of cups in which milk was added first or last. If we take the simplest situation in which there are only two cups (so, literally like the Pepsi challenge) then the lady has 50% chance of guessing correctly. If she did guess correctly would we, therefore, be confident in concluding that she can tell the difference between cups in which the milk was added first from those in which it was added last? Probably not: most of us could perform fairly well on this task just by guessing. Imagine we complicate things by having 4 cups (two with milk added first and two with milk added last). There are 6 orders in which these cups can be arranged, so if the lady gets the order correct then she would only do this by chance I in 6 times; our confidence in her genuine ability to detect when the milk was added has increased because the task is more difficult. If we now use 6 cups, there are 20 orders in which these
can be arranged and the lady would only guess the correct order 1 time in 20 (or 5% of the time). Again, if she got the correct order we would now be very confident that she could genuinely tell the difference. Finally, if, as Fisher did, we use 8 cups there are 70 orders and the lady has only a 1 in 70 chance of getting the order correct by guessing (she will correctly guess only 1.4% of the time). The task has become incredibly difficult and so we’d certainly be very confident that she could genuinely tell the difference if she got the answer correct.2
This example illustrates an important point about science: we draw inferences based on confidence about a given set of results. As the probability of the result occurring by chance decreases, our confidence in the result being genuine increases. So, when we compare groups in which supposed cause is present (experimental group) with one in which supposed cause is absent (control group) the difference between these groups has to be sufficiently large that we have confidence that the difference is not a chance result. Just like with Fisher’s tea experiment, the more unlikely a chance result is, the greater our confidence in our results being genuine. Fisher actually suggested that an appropriate level of confidence is 95%, so that if the probability of our results being due to chance is less than 5% (or I in 20 – just as in the 6 cups example) then we should conclude that a result is genuine. This example lays the foundations of the statistical ideas presented in Chapter 2.
Fisher also noted a couple of other important points. The first was that he realized how important randomization is to experimentation. He correctly noted that his tea experiment depends upon the cups being ordered randomly, so the lady could have no way of predicting the order other than by taste. So, he recognized that randomization is an important tool in isolating causal factors. Second, he noted that although his single tea experiment was impressive in itself (if the lady correctly identified the order of the cups), it would be much more convincing if she could replicate the feat. Again, this is an important concept: our confidence in a given scientific statement will increase if a given set of results can be replicated many times (and by different researchers).
So, What is the Difference Between Experimental and Correlational Research?
We’ve covered many concepts including the isolation of cause and effect, and the basic framework for the experimental method. To answer our initial question now becomes easy. Experiments seek to isolate cause and effect by manipulating the proposed causal variable or variables. All that we have so far discussed in this section shows how and why we do this. In correlational research, we do not manipulate anything; we merely take a snapshot of several variables at a point in time. Based on what you’ve learnt in this chapter it should be clear that in taking such a snapshot causal variables are not isolated, confounding variables are not always controlled, and tertium quids are not always measured or eliminated. In short, correlational research does not allow causal statements to be made (at least if you adopt Mill’s and Popper’s ideas). That is not to say that experimental research is the only kind to have any merit. The obsession with control and manipulation of variables in experiments can result in some very artificial situations and alien environments, so the resulting behaviour we observe in people may not be representative of how they would respond in a more natural setting. As ever, the solution is probably a compromise: verify causal hypotheses in an experimental way and corroborate these findings with more naturalistic observations.
1.3 The Dynamic Nature of Scientific Method
* * *
Popper was really putting forward a framework by which theories compete over time and much of modern science works in this way. However, science also tends to work within a modus operandi that is dictated by common methods of testing and comparing theories. Kuhn (1970) wrote widely about the use of paradigms in science. A paradigm is really just a framework within which scientists work and it can operate at many different levels: it can be an experimental method commonly adopted to look at a problem or a theoretical or philosophical framework. Whatever you mean by a paradigm, the word implies agreement between scientists on several issues: (1) the problems (what should be studied?); (2) the methods (how should the problem be studied?); and (3) theoretical frameworks (on which framework can hypotheses be based?). At a basic level, the form of agreement can be metaphysical (e.g. the agreement that it is possible to explore past events and use them to predict future ones – determinism. At higher levels, the agreement can be about the theoretical constructs and their conceptualization (e.g. is it appropriate to theoretically characterize the mind as a machine?). There is also the obvious issue of methodological agreement (how is it best to study and/or measure a given construct?).
Kuhn (1970) believed that paradigms are dynamic; they change over time. He suggested three stages:
Pre-paradigmatic stage: This is a period of confusion in which different scientific schools (or communities) disagree about some issue (methodological, theoretical or otherwise) in science.
Normal science: Over time, some consensus is reached about agreeable methods/frameworks for study. A period of normal scientific development ensues through which the limitations of a paradigm may become apparent, or discoveries are made that compromise the theoretical or methodological framework within which these scientists operate. When these anomalies become too great, the paradigm must change.
Extraordinary science: The final stage of development is when the paradigm breaks down and gives rise to a new paradigm. This paradigm shift is characterized by a new theoretical and/or methodological approach.
One psychological example of Kuhn’s paradigm shift was the conversion from behaviourism to cognitivism in the 1970s (the so-called ‘cognitive revolution’). The behaviourist paradigm was popular for nearly 60 years between early 1900s and the 1960s. At the turn of the last century (1900), there was much disagreement about the methods used to investigate psychology and a growing dissatisfaction with the non-empirical work of Freud and his counterparts. The behaviourist movement evolved out of a growing agreement that attempts should be made to measure and predict psychological phenomena. The theoretical belief (based on Pavlov’s, 1927, pioneering work on animals and Watson’s work with humans) was that behaviour was a series of learnt responses to external stimuli. In this sense, cognitive control over responses was largely ignored. This basic premise was used to explain a variety of psychological phenomena, most famously the acquisition of fears, which were thought to derive from experiencing some stimulus in the presence of some other traumatic event and forming an association between the two things (see Field & Davey, 2001; Davey & Field, in press for reviews). Indeed, Watson and Rayner’s (1920) classic study demonstrated how a small child could learn to fear a rat by associating it with the fear evoked by a piece of metal being hit with a claw hammer. This was an important shift away from studying unobservable phenomena onto looking at observable and measurable phenomena. Thinking back to what we’ve learnt about measurement, this shift was towards direct measures of psychological constructs (we can measure behaviours and this tells us all we need to know about motivations).
A course of ‘normal science’ followed and during the 1960s behaviourist methods were especially at the forefront of clinical practice (in which patients un-learnt their anxiety by having their fearful responses replaced with calm ones). However, over time the limitations of the behaviourist framework became apparent; for example, the behaviourist explanation of phobias failed to explain why some people who experience a stimulus with a trauma do not develop phobias (see Rachman, 1977). The result was a dramatic shift towards cognitive explanations and therapeutic interventions. This shift stepped back from observable and measurable phenomena in many senses because psychologists were now interested in motivations and cognitions that could not be directly measured (we can’t, for example, measure motivation to succeed, directly).
Kuhn’s ideas do not necessarily conflict with Popper’s: both agree on the need for criticism in scientific development. Really, Popper’s
ideas work within the ‘normal science’ period of Kuhn’s model. However, it isn’t clear (to me) that dramatic paradigm shifts occur very often. All paradigms are limiting and so in many cases paradigms are developed and altered according to what happens in the course of ‘normal science’. It is perhaps only when a technological advance occurs, or if a paradigm is too limiting, that it results in a complete shift. For example, artificial intelligence became the vogue methodology with the development of computers in the 1980s and 1990s and the advent of functional brain scanning devices has recently led to a glut of psychologists running around like headless chickens trying to get their hands on these devices so they can scan anything that moves! However, although these methods have become the fashion, they have not replaced other methodologies. In some cases, paradigms even merge, for example, behaviourist and cognitive approaches now combine in explanations and treatments of psychological disorders (see Field & Davey, 2001; Davey & Field, in press).
1.4 Summary
* * *
This chapter has explored the reasons why scientists conduct experiments to answer research questions. We began by looking at why scientists measure things – to make research conclusions comparable across researchers. In addition, we started to think about the way in which we measure things and the different levels of measurement. This led us to a discussion of why experimentation is valuable for isolating cause and effect and to discuss some different philosophical ideas about causality. Ultimately we discovered that to isolate causal variables we need to look at situations in which the supposed cause is present (an experimental group) and compare it against a situation in which the cause is absent (a control condition). Finally, we used these ideas to look at how theories develop using Popper’s ideas about falsification and considered an example of when a control group might not actually act as a control. We also discovered that we could use probability to determine the ability of someone to detect when milk is added to tea – which was a bonus!