by Andy Field
3 Experimental Designs
* * *
Whether a study’s findings are useful or not depends crucially on its design. No matter how ingenious or important an idea for an experiment might be, if the study is badly designed, it’s worthless. It’s worth bearing in mind that, wonderful though statistics might be, no amount of clever statistical analysis can help if your study is poorly designed in the first place. In this chapter, we will consider some of the issues in designing an experiment, and the advantages and disadvantages of different types of experimental design.
3.1 The Three Aims of Research: Reliability, Validity and Importance
The aim is to devise an experiment that produces results which are valid (in the sense that they actually show what it is that you intend them to show: see page 44), reliable (potentially replicable by yourself or anyone else, so that they can be confirmed and reproduced: see page 47) and generalizable (the findings should have a wider application than merely to the participants on whom the study was originally performed, in the particular circumstances in which they were originally tested). Ideally a study should also have importance. This is rather subjective: research can possess all of the previous qualities and still be essentially trivial. However, the opposite cannot be true: research cannot possibly be important if its findings are unreliable or invalid. The apparent importance of research findings is a completely separate issue to how well the research was designed and carried out, and the two issues should not be confused. However, as Sidman (1960) pointed out, science has a habit of changing its mind over time as to what are and are not important findings. Consequently, one should aim to obtain data which are not so tied to the theory that prompted their collection as to be useless without that theory.
As mentioned in Chapter 1, the virtue of the experimental method for doing science is that it is an excellent procedure for determining cause and effect. A well-designed experiment isolates causal factors well; a poorly designed experiment leaves so much scope for alternative explanations of the results that were obtained, that the results are virtually useless.
What Makes a Study’s Design Good or Bad?
In most studies in psychology, we aim to get at least one score from each participant. (I say ‘most’, because of course there are studies in which this isn’t true – for example, where we get frequency data, merely counting how many participants fall into various categories. But for simplicity’s sake, let’s stick to discussing scores.) Any obtained score can be thought of as consisting of a number of different components:
A ‘true score’ for the thing we hope we are measuring;
A ‘score for other things’, that we are measuring inadvertently;
Systematic (non-random) bias – which isn’t too bad as long as it affects all participants in the study, and not just those of some groups more than others;
Random (non-systematic) error, which should cancel out over large numbers of observations.
We want our obtained score to consist of as much ‘true score’, and as little of the other factors, as possible. If our obtained score contains a large dollop of a score for other things, it’s not a valid measure of what we think it is. If our obtained score has a high proportion of random error in it, it won’t be a reliable measure of what we want to measure. Whenever you read a study (or design one!) pause to think about what contribution each of these factors might be making to overall scores.
Evidence for the intellectual inferiority of women
That got your attention didn’t it? Let me give a concrete example of how analysing a score into its components helps in evaluating its worth. In the 19th century, evolutionary theory sparked a great deal of interest in the development of human abilities, especially ‘intelligence’. The French neurologist Paul Broca investigated this issue by making very careful measurements of brain weight. He found that caucasian men had larger brains than caucasian women, who in turn had larger brains than negroes. Modern brains were supposedly heavier than mediaeval brains, and French brains were heavier than German brains! These weight differences were considered to reflect the differences in intelligence between these different groups. White middle-class men (and French middle-class men in particular) were at the pinnacle of evolution. One of the founders of social psychology, Le Bon (1879), claimed:
In the most intelligent races, as among the Parisians, there are a large number of women whose brains are closer in size to those of gorillas than to the most developed male brains. This inferiority is so obvious that no one can contest it for a moment: only its degree is worth discussion. All psychologists who have studied the intelligence of women . . . recognise today that they represent the most inferior forms of human evolution and that they are closer to children and savages than to an adult, civilised man.
Before any male readers use this as scientific justification for delegating their partners to housework duties, let’s consider the various factors that go to make up Broca’s ‘overall scores’ – his measurements of brain weight. What do these really consist of?
Firstly, is brain weight a ‘true score’ for intelligence? Without going into detail, it’s known that within a species, there is no relationship between brain weight and intelligence: some very bright people have possessed spectacularly small brains, and vice versa. So, on these grounds alone, brain weight is not a valid measure of intelligence. What, then, is brain weight a measure of? What ‘other things’ does it reflect? We don’t know enough about brain function to answer this as yet, but it seems pretty certain that brain weight isn’t a measure of anything that’s interesting from a psychological point of view.
What then gave rise to the differences that Broca found? He was very painstaking in his measurements, so random measurement error probably didn’t contribute much to his overall scores of brain weight. However, his measurements may have been affected considerably by systematic biases. Gould (1981), in a fascinating book on how ‘scientific’ measurements have been misused in the service of bigotry, describes a number of ways in which Broca’s male and female brains differed systematically. His female brains came mainly from elderly women, and the male brains from younger men who had died in accidents. This immediately biases measurements in favour of male brains being heavier, as brains shrink with age – quite apart from any senile degenerative changes. Brain size is also related to body size: men may have bigger brains than women, but they also have bigger bodies. Gould (1981) reanalysed Broca’s data, and found little or no difference in brain size between men and women, once body size was properly taken into account. So, ladies, you need not despair – put down that Hoover and put your little brain to good use by reading the rest of this book. (Note that it might still be the case that women are intellectually inferior – it’s just that Broca’s measurements don’t prove it! I advance this argument purely as a logical possibility, and would ask for all hate mail on this topic to be addressed to Andy).
In short, a consideration of the constituents of Broca’s ‘overall score’ of brain weight suggests that weight is a reliable measurement (in the sense that if I weighed Broca’s brains, I’d probably get very similar results), but not a valid measure of intelligence – both because it doesn’t actually measure intelligence and because it is open to systematic measurement biases which mean that it is measuring things other than intelligence.
Maximizing Your Measurements’ Reliability
One factor in achieving reliability is to make sure that the dependent variable is measured as precisely as possible (see page 48 on ‘measurement error’). An aid to precise measurement is precise, unambiguous and objective definition of whatever it is you are measuring. In some cases, definition is relatively clear-cut: for example, our definition of ‘memory’ might be ‘the number of words recalled in our experiment’. In other cases, definition can be more problematic. A simple definition might not be available. For example, we might be interested in the effects of frustration on children’s aggression: aggression is notoriously diffic
ult to define. We could get round this by arriving at a definition by consensus: for example, we could film the children’s behaviour following our manipulations, and then get a group of independent judges to rate the activities for ‘aggression’. Those features of the children’s behaviour for which the judges showed high agreement would be used as measures of ‘aggression’. Another technique is to resort to an ‘operational definition’: for example, when studying play, we could say ‘for the purposes of this study, I define play as behaviour patterns X, Y and Z’. Whether or not other people agree on this definition is up to them, but at least you’ve made it clear exactly what it is you are measuring.
Maximizing Your Measurement’s Validity
There are different ways in which your study’s results can lack validity (see page 44 for a discussion of this). If your obtained measurements are not due to your manipulations, but are instead actually caused by other factors, then they lack ‘internal’ validity. The key to having high internal validity is to use a good experimental design (see below). If your findings are not representative of humanity, but are only valid for the specific situation within which you obtained them, then they lack ‘external’ validity (or ‘ecological validity’). Experimental effects can be very reliable (in the sense of reproducible) without necessarily having much to do with how people function in real life. External validity is trickier to deal with, and requires you to use your intuition and judgement to some extent.
Threats to internal validity
Most of the factors that may reduce internal validity can be avoided by sound experimental design. Here are some of the most common threats to internal validity.
Group threats: If our experimental and control groups were different to start with, we might merely be measuring these differences rather than measuring any differences that were solely attributable to what we did to the participants. Selection differences can produce these kinds of effects – for example, using volunteers for one group and non-volunteers in another, or comparing a group of undergraduates to a group of mental patients. ‘Group threats’ of this kind can largely be eliminated by ensuring participants are allocated to groups randomly. However, if you are looking at sex- or age-differences on some variable, group threats of some kind are largely unavoidable (more on this below).
Regression to the mean: If participants produce extreme scores on a pre-test (either very high or very low), by chance they are likely to score closer to the mean on a subsequent test – regardless of anything the experimenter does to them. This is called regression to the mean, and it is particularly a problem for any real-world study that investigates the effects of some policy or measure that has been introduced in response to a perceived problem. Suppose, for example, the police had a crackdown on speeding as a consequence of particularly high accident rates in 2001: if accident rates decreased in following years, it would be tempting to conclude that this was a consequence of the police’s actions. This might be true – but the decrease might equally well have been due to regression to the mean. Because accident rates were very high in 2001, they were more likely to go down in subsequent years than up – hey presto, you have an apparently effective traffic policy. The same kind of argument applies to interventions to help poor readers, depressives, ‘alternative’ medical treatments, etc.
Time threats: With the passage of time, events may occur which produce changes in our participants’ behaviour; we have to be careful to design our study so that these changes are not mistakenly regarded as consequences of our experimental manipulations.
History: Events in the participants’ lives which are entirely unrelated to our manipulations of the independent variable, may have fortuitously given rise to changes similar to those we were expecting. Suppose we were running an experiment on anxiety in New Zealand, a country known for its propensity to earthquakes. We test participants on Monday, to establish baseline anxiety levels, administer some anxiety-producing treatment on Wednesday, and test the participants’ anxiety levels on Friday. Unknown to us, there is an earthquake on Thursday. Anxiety levels are much higher on Friday due to the earthquake, but we mistakenly attribute this increase to our experimental manipulations on Wednesday. (Don’t worry, there are ways round this problem, coming shortly – and they don’t involve avoiding doing research in New Zealand . . .)
Maturation: Participants – especially young ones – may change simply as a consequence of development. These changes may be confused with changes due to manipulations of the independent variable the experimenter is interested in. For example, suppose we were interested in evaluating the effectiveness of a method of teaching children to read. If we measure their reading ability at age four, and then again at seven after they have been involved in the program, we can’t necessarily attribute any improvement in reading ability to the program: the children’s reading might have improved anyway, perhaps due to practice at reading in other contexts, etc. In this case, it’s pretty obvious that maturation needs to be taken into account. However, these kinds of effects can occur in more subtle ways as well. For example, in a pre-test/post-test design in adults, any observed change in the dependent variable might be due to a reaction to the pre-test. The pre-test might cause fatigue, provide practice, or even alert the participant to the purpose of the study. This may then affect their performance on the post-test.
Instrument change: Good physicists frequently calibrate their equipment. This guards against obtaining apparent changes in what they are measuring merely because their measuring device has changed. Imagine working in a nuclear power station and concluding that it was safe to go into the reactor core, unaware that the ‘negligible radiation’ reading on your Geiger counter was due to the fact that the batteries had run down. Similar (if somewhat less dramatic) effects are less obvious in psychology, but may happen nevertheless: for example, interviewers may become more practised, or more bored, with experience. An experimenter may get slicker at presenting the stimuli in an experiment. Factors such as these may change the measurements being taken, and these changes may be mistaken for changes in the participant rather than in the measuring tool.
Differential Mortality: This sounds a bit dramatic! If your research involves testing the same individuals repeatedly, participants may sometimes drop out of the study for various reasons. This can make the results of the study difficult to interpret. For example, if all of the unsuccessful cases on a drug treatment program drop out, leaving us only with the successful cases, then a pre-test on the whole group is not comparable to a post-test on what remains of the group. There might be systematic differences between the people who remain and those that dropped out, and these differences might be wholly unrelated to your experimental manipulations. In the current example, it might be that those who remained in the drug treatment program had higher levels of willpower than those who left.
Reactivity and Experimenter Effects: Measuring a person’s behaviour may affect their behaviour, for a variety of reasons. People’s reaction to having their behaviour measured may cause them to change their behaviour. I was once asked to take part in a long-term study on the relationship between diet and health: completing the dietary questionnaire made me realise that my diet consisted almost solely of pizzas, and so I changed my behaviour (well, for a while, at least). Perhaps I’ll now live to a hundred as a consequence of my new healthy lifestyle, whilst the organizers of the study end up with the mistaken impression that living solely on pizzas leads to a long and healthy life. Merely measuring my behaviour caused it to change. There’s a huge social psychological literature on ‘experimenter effects’: the experimenter’s age, race, sex and other characteristics may affect the results they obtain (Rosenthal, 1966; Rosenthal and Rosnow, 1969). Experimenters can subtly and unconsciously bias the results they obtain, by virtue of the way in which they interact with their participants. Participants often respond to the ‘demand characteristics’ of an experiment (Orne, 1962, 1969) – that is, they try to behave in a way that they think wi
ll please (or, occasionally, annoy!) the experimenter, for example by attempting to make the experiment ‘work’ by giving the ‘right’ data.
Ideally, you could minimise these effects by using a ‘doubleblind’ technique. This involves both the experimenter and the participant being unaware of the experimental hypothesis and which condition the participant is in. If the experimenter is as ignorant as the participant about what’s going on, there’s little opportunity for the experimenter to bias the results. Unfortunately, as a student, you are probably unlikely to have the resources to employ someone to run your experiment for you, and it has to be said that most psychologists don’t bother with double-blind techniques either.
Related to demand characteristics is the possibility that participants may show ‘evaluation apprehension’ (Rosenhan, 1969), a posh term for anxiety about being tested. Many non-psychologists seem to fail to appreciate that the experimenter is usually interested only in average performance, and isn’t at all interested in the data of them as an individual. I’ve run experiments in which participants have treated the experiment as a test of their abilities, and have been so concerned with not looking stupid in front of me that they have failed to supply me with decent data! Finally, questionnaires may give rise to ‘social desirability’ effects, with respondents telling porkies about their income or sexual practices to look good to the experimenter. (‘How many times have you had sex with a horse?’ is unlikely to elicit many accurate replies!)