by Andy Field
Reactivity is especially a problem when obtrusive measures which are under the participant’s control (e.g. verbal reports) are used. Participants may show practice or fatigue effects, or become increasingly aware of what the experiment is about. However even something as apparently ‘low-level’ as reaction times can be affected by these kinds of effects. Some studies have shown that the elderly have slower reaction times than young undergraduates. Some of this difference may be due to age-related cognitive decline, but it may also occur as a result of different age-groups adopting different strategies within the experimental situation. There is some evidence that the elderly are more cautious in novel situations and are more concerned to help the experimenter by making fewer errors. Both of these factors would conspire to increase reaction times in a way which could be mistaken for age-related physiological deterioration rather than an increased desire to please the experimenter.
So, there are lots of extraneous factors that can lead to changes in behaviour, changes that can be confused with the effects of our intended manipulations. Good experimental designs guard against (‘control’ for) all of these competing explanations for the changes in our dependent variable, and thus enable us to be reasonably confident that those changes have occurred because of what we did to the participants – that is, they are a direct consequence of our experimental manipulations.
Threats to external validity
Over-use of special participant groups: McNemar (1946) pointed out that psychology was largely the study of undergraduate behaviour. Rosenthal and Rosnow (1975) found that, 30 years later, 70–90% of participants were still undergraduates. Research suggests that students have higher self-esteem, take drugs and alcohol less (hah!), and are less likely to be married than are other young people. Young people in general are lonelier, more bored and more unhappy than older people (try telling that to your granny). Using volunteers may also cause problems: Rosenthal and Rosnow (1975) found that the participants recruited as volunteers via adverts were more intelligent, better educated, had higher social status and were more sociable than non-volunteers. On the downside, volunteers for research into psychopathology, drugs and hypnosis are more likely to have mental health problems. Volunteers generally have a higher opinion of, and respect for, science and scientists. That’s nice, but a bit of a nuisance if it makes them respond differently than non-volunteers. As mentioned in the section on ‘generality’, the extent to which this is a problem depends on the kind of research being done: it’s not automatically the case that it’s invalid to use volunteer student participants. A student’s visual system may be pretty much like that of any other human, even if their social behaviour is a bit strange!
Restricted numbers of participants: This is more a threat to reliability, but it also affects one’s ability to generalize to the population as a whole. Cohen (1988) has pointed out that most psychology experiments use too few participants for them to have a reasonable chance of attaining statistical significance. (See page 154 on ‘power’ so that you don’t make the same mistake).
Maximizing Your Measurement’s Generality
Closely related to external validity is the issue of whether our findings will generalize to other groups of participants in other times and places. This is usually taken for granted by psychologists. The best measure of generality is by empirical testing – replications of the experiment by other people in other circumstances. If food additives make children hyperactive in Chippenham, then they should also do so in downtown Kuala Lumpur. If they don’t, that might be interesting in itself, but it would mean that we can’t make sweeping statements about the effects of additives on humanity. Generality can be enhanced at the outset by representative sampling – by making sure that you have indeed used participants who are typical of the population that you want to make statements about. Sampling methods include random sampling, and stratified sampling (where the sample is deliberately constructed to mirror the characteristics of the parent population. So, for example, if the population consists mostly of 90% poor people and 10% rich people, so too does your sample). Threats to generality may come from using volunteers and undergraduate students. Generalization needs to be confirmed not only across participants, but also across experimental designs, methods, apparatus, situations, etc.
The generality of findings will depend to a large extent on the kind of research that’s being done. All other things being equal, the results from a study on basic cognitive processing are more likely to be generalizable than the findings from a study on the social interactions of city office workers, because there is probably greater scope for social and cultural influences to affect the results of the latter.
3.2 Different Methods for Doing Research
* * *
In the following sections, we will discuss three basic types of research that are commonly used in psychology: observational methods, quasi-experimental designs and ‘true’ experimental designs. The essence of observational methods is that they don’t involve direct manipulation of any variables by the researcher: behaviour is merely recorded, although systematically and objectively so that the observations are potentially replicable. Experimental methods, in contrast, do involve manipulation by the experimenter of one or more independent variables, together with some objective measurement of the effects of doing this. ‘Quasi-experimental’ methods are used in situations in which a well-designed experiment cannot be carried out for one reason or another (more on this in a moment).
Observational and quasi-experimental methods don’t allow us to unequivocally establish cause and effect in the same way that true experimental designs do, but they are still very useful. Also, a discussion of their limitations helps to demonstrate the strengths of the experimental approach.
Observational Methods
One way to find out about a phenomenon is simply to look at it in a systematic and scientifically rigorous way. Personally, I think that psychologists have often jumped in at the deep end, running experiments to find out about some phenomenon before they have collected sufficient observational data on it. Historically, most sciences (biology, physics and chemistry) were preceded by a phase of naturalistic observation, during which phenomena were simply watched in a systematic and rigorous way. Experimentation came later. Psychology has tended to skip the observational phase and tried to go straight into doing experiments. This isn’t always a good thing, since experimentation without prior careful observation can sometimes lead to a distorted or incomplete picture being developed.
Lessons could be learnt from the way in which biology supplanted comparative psychology as a means of finding out about animal behaviour. In the first half of the 20th century, behaviourism was the dominant method of studying animal behaviour. Innumerable experiments were performed to investigate learning, using a few species (mainly rats and pigeons) and highly artificial tasks (such as lever pressing) performed in very unnatural environments (such as mazes and Skinner Boxes). A lot of useful information has been obtained by using these experimental methods. However, after about 1960, most of the really interesting stuff on how animals behave has come from ethology and sociobiology, movements which have their roots in biology, not psychology.
One factor in the demise of comparative psychology as a discipline was that its attempts to produce universal laws of learning, that applied to all species including humans, were shown to be flawed. The observational data on a wide range of species that were produced by ethologists drew attention to this by demonstrating that many species show species-specific behaviour patterns as a result of natural selection having adapted their learning abilities to their particular environmental niche. While the study of learning was confined to experiments on a few species under highly constrained conditions, these problems were not apparent and could be overlooked. The ethologists and their observational data showed that the experimental methods used by the psychologists produced data that were highly reliable (in the sense of reproducible) but not necessarily valid (in the s
ense of providing an accurate picture of what animals can learn in their natural environments).
The strength of observational methods is that they enable one to get a good idea of how people (or animals) normally behave. This may be quite different from how they behave in an experiment. A good example of this is research on driving behaviour. A great deal of money has been spent on producing highly realistic driving simulators. While simulators are useful for studying some aspects of driving, they will never tell us much about how people normally drive for two reasons. First, participants know they are being studied and their behaviour is being recorded, and so they are likely to be on their best behaviour: they are unlikely to perform the various risky actions that they might perform in real life when they think they are not being watched, such as fiddling with the radio, shouting at the kids, picking their nose, driving with their knees, reading a map and so on. Secondly, no matter how ‘realistic’ the simulator, participants always remain aware of the fact that they are in a simulator, and that their actions will not have any real-world consequences in terms of death or injury. What this means is that for many aspects of driving behaviour, if one wants to get an accurate picture of how people really behave, unobtrusive observational methods may be preferable to experiments.
The downside to observational methods is that they are generally much more time-consuming to perform than experiments. Also, because of their non-intrusive nature, they don’t allow the identification of cause and effect in the same way that a well-designed experiment does. However, systematic observations may often provide hypotheses about cause and effect, that can then be tested more directly with experimental methods. A full description of observational techniques – and the statistics that you should use to analyse the data obtained – is outside the scope of this book. For further information on this topic, have a look at Martin and Bateson’s (1993) book on the topic.
Quasi-Experimental Designs
In Chapter 1, Andy discussed how the experimental method is ideal for determining cause and effect. Sometimes, especially in real-world situations, it isn’t possible to conduct a true experiment, and one has to resort to a ‘quasi-experimental’ design. (Don’t get this confused with research conducted by an experimenter with a hump and a penchant for bell-ringing – that would be a Quasimodo experiment).
In a quasi-experimental study, the experimenter does not have complete control over manipulation of the independent variable. He or she has control over the timing of the measurement of the dependent variable, but no control over either the timing of the experimental manipulation or over how participants are assigned to the different conditions of the study. For example, it may be impossible to allocate participants randomly to different levels of the independent variable, for ethical or practical reasons. As a consequence, it is not possible to isolate cause and effect as conclusively as with a ‘true’ experimental design, in which participants are randomly assigned to different groups which receive different treatments. In a true experiment, the experimenter has complete control over the independent variable: he or she has control over the timing of the measurements of the dependent variable and control over the timing of the experimental manipulations. (The latter in itself is an aid to establishing cause and effect relationships, since we can avoid our intervention coinciding with events in the participants’ lives which might produce similar effects).
An example of a quasi-experimental study and its limitations
Suppose we were interested in whether daytime headlight use made motorcyclists more detectable to other road users. The ideal way of testing this would be to take a very large group of motorcyclists, and randomly allocate them to one of two conditions. One group would be told to use their headlight during the daytime, and the other group would be told not to. At the end of some period, say five years, we could see how many accidents had been experienced by each group, as a consequence of another road user failing to see them. This would be a true experiment, because we would have complete control over the independent variable (headlight use/non-use). Random allocation of participants to the two groups would enable us to ensure that no other variables could systematically affect our results.
In practice, there are obvious ethical reasons why this experiment cannot be done. If headlight use did make a difference to motorcyclists’ detectability, then we would be taking risks with people’s lives. Instead, we would have to make do with a quasi-experimental design: we could take a group of motorcyclists who already prefer to ride with their headlight on, and compare them to a group of motorcyclists who prefer to ride with their headlight off. We have two groups of motorcyclists, as before, but there is a crucial difference: instead of the experimenter deciding which condition they perform, the participants have effectively allocated themselves to the groups. In other words, we would make use of a pre-existing difference between the participants, and use this as a basis for putting them in the different conditions of our study.
This might seem like a small difference, but it would have big implications for the conclusions that we could draw from the study. In the case of the truly experimental version, if we have allocated participants to conditions randomly, we know that the only difference between the participants in one condition and the participants in the other condition is that one lot used headlights and the other lot didn’t. If there is any difference in accident rates between the two groups, we can be reasonably confident that it is due to our experimental manipulation of headlight use.
In the quasi-experimental version, our observed results might be due to the difference between the two groups in terms of headlight use, but they might equally well be due to innumerable other factors which also differed between the two groups. For example, it might be that motorcyclists who use headlights are more safety-conscious than those who don’t. If so, the difference in accident rates might have nothing to do with headlight use, but instead occurred because motorcyclists in the ‘headlight-use’ group rode more cautiously than those in the ‘no-headlight’ group. Another possibility is that motorcyclists in the ‘no-headlight’ group might be riding older machines, whose generators don’t have the capacity to cope with running lights all day long. Older machines also have poorer brakes. Therefore any difference in accident rates between headlight-users and non-users might be due to differences in braking ability, and have little to do with the effects of headlights on visibility. All of these alternative possibilities (‘group threats’, in the terminology used above) can of course be explored and possibly eliminated as alternative explanations for the observed results; however, the beauty of the experimental method is that it eliminates them automatically, rather than requiring further research to be performed to do so.
The difference between quasi-experimental and true experimental methods can be quite subtle. In psychology, we often use independent variables that are not wholly under our control, and hence strictly speaking, we are using quasi-experimental methods. Age and gender are the examples which spring to mind. Suppose you wanted to know if there were age-differences in problem-solving ability. You could take two groups of participants, one young and the other old, and give them some test of problem-solving ability. If there were any differences in performance, you might want to conclude that there were age-differences in problem-solving ability. However, this would be misleading. You may have demonstrated a difference in ability between the two groups, but because participants were not truly randomly allocated to one age-group or the other, but instead came ‘ready-aged’, you are unable to eliminate all of the other possible reasons for why the two groups of participants differed. The young and elderly groups differ in lots of ways other than chronological age: they have been born at different times, and so have had different life experiences which may have affected the way in which they behave in experiments. These ‘cohort effects’ complicate the interpretation of age- and gender-differences in psychology, and they arise because in the case of variables such as these, it is impossible for the expe
rimenter to have complete control over the independent variables in the study. We’re being a bit pedantic here, and most researchers investigating the effects of age and gender would consider that they are performing ‘true’ experiments; however, the important point is to be aware that there are almost always complications in interpreting age and gender differences in psychology, and that they arise because these variables are not wholly under the experimenter’s control.
Types of quasi-experimental design
In the one group post-test design (Figure 3.1) we apply some treatment and then measure the participant’s behaviour afterwards. This is a seriously flawed design. The change in the participants’ behaviour may or may not be due to what we did. This design is prone to time effects, and we do not have any baseline against which to measure the strength of our effect. There’s really not much that you can usefully conclude from the results of a study like this.
The one group pre test/post-test design (Figure 3.2) is the same as the previous design, except that we measure the participants’ performance before we apply our treatment, and then again afterwards. By comparing the pre- and post-intervention scores, we can assess the magnitude of the treatment’s effects. However, it is still subject to time effects, and we still have no way of knowing whether the participants’ performance would have changed without our intervention.