How to Design and Report Experiments

Page 14

by Andy Field

3.6 Practical Tasks

* * *

Which of these are quasi-experimental designs, and why?

A comparison of two different methods of teaching statistics:

Dr. Plonker teaches statistics to Idontgoto University first-years, using an excellent statistics book (‘Discovering Statistics using SPSS for Windows’ by A. Field, available from all good bookshops). Dr. Skollob teaches an identical course to first-years at Nohope University, but without using this rather fine statistics book. Both groups of students are given the same statistics test at the end of the course, and it is concluded that Andy’s book makes a big difference to students’ statistics comprehension.

The effects of shift-patterns on employee performance:

Concerned at falling profits, a big insurance company recruits a psychologist to see if the performance of their telephone sales division could be improved. She compares the number of sales made by telephone sales operatives working on the company’s existing three different shifts (morning, midday and evening). A significant difference is found between the three groups on this measure, and it is concluded that shift-pattern affects operatives’ sales performance.

Gender differences in the effects of stress on performance:

A psychologist wants to know if memory is impaired by stress, and whether this interacts with the gender of the participant. He presents participants with a word list and measures how many words they can recall from it five minutes later. There are four groups of participants, two male groups and two female. One group of each gender is stressed during the five minute interval, by watching a video of an eye operation. The four groups differ in the number of words recalled: overall, females remember more words than males, and the stress manipulation has no effect on recall.

Answers:

This is a quasi-experimental design because the participants are not allocated randomly to the two experimental conditions (statistics book versus no statistics book). Students at Idontgoto University might differ from those at Nohope University in all sorts of ways – it might be harder to get into Idontgoto in the first place, so that the students at the two universities differ in initial ability; or Nohope students might have a heavier workload on other courses so that they can’t devote as much time to studying statistics. I leave it to you to think of other possible confounding factors. The important point is that a true experimental design (with random allocation of students to the two groups) would eliminate all of these alternative explanations.

This is a quasi-experimental design. Workers were recruited on the basis of the shift-patterns that they were already working, rather than being allocated randomly to the three different shifts. People who opt to work at different times may differ in various characteristics that are not under the experimenter’s control. These might differ systematically between conditions and hence act as confounding variables. For example, there is a psychological dimension of ‘morningness/eveningness’: some people function better early in the day, whereas others perform better late at night. It’s unlikely that people would volunteer for shift-times that were at odds with their own diurnal cycle, so this at once introduces a possible systematic difference between the three conditions. Note that there is another problem with this particular study. If sales performance is found to be affected by shift-pattern, there are several possible explanations for this. It might be due to the effects of the shift-pattern on the operatives them-selves (e.g., they might be more tired at one point in the day than another) or it might be that the number of potential sales that can be made varies during the day (more people might be available to be phoned during the evening than during the morning. If the number of successful sales is a constant proportion of the number of phone calls made, then the evening shift-workers are likely to produce more sales for this reason alone).

Most people would consider this to be a true experiment, but strictly speaking you could argue that it is quasi-experimental in the sense that the experimenter cannot allocate participants randomly to one gender or the other. While the psychologist does not have complete control over the ‘gender’ independent variable, he does, however, have complete control over the other independent variable, ‘stress level’. Within each gender, the psychologist can allocate participants randomly to the ‘stressed’ and ‘non-stressed’ conditions. The design is good enough to enable meaningful conclusions to be drawn about the effects on memory of gender and stress: however, as with any research on gender effects, because the experimenter can’t manipulate this variable directly, the conclusions about which aspects of gender affect performance are often fairly ambiguous (are the effects due to biological differences, socialisation differences or a mixture of the two?).

3.7 Further Reading

http://www.apa.org/ethics/code.html [A summary of the American Psychological Association’s Ethical Guidelines].

American Psychological Association (1992). Ethical principles of psychologists and code of conduct. American Psychologist 47, 1597–1611.

http://www.bps.org.uk/documents/Code.pdf [A downloadable Adobe Acrobat file containing the Code of Conduct of the British Psychological Society].

Campbell, D.T. and Stanley, J.C. (1963). Experimental and quasi-experimental designs for research. Chicago: Rand-McNally.

Gould, S.J. (1981). The mismeasure of man. London: Penguin. [A fascinating study of the misuse of quantitative methods in the service of prejudice and bigotry].

Martin, P. and Bateson, P. (1993). Measuring behaviour: an introductory guide (2nd edition) Cambridge: Cambridge University Press. [Essential reading if you plan to use observational techniques].

Rosenthal, R. (1966). Experimenter effects in behavioral research. New York: Appleton-Century-Crofts. [Rosenthal’s orignal work on experimenter effects somewhat overstates the case, but is interesting nevertheless].

Rosenthal, R. and Rosnow, R.L. (Eds.) (1969) Artifact in behavioral research. New York: Academic Press. [A collection of interesting papers on the topic of artifacts and potential sources of bias in psychology research].

Rosenthal, R. and Rosnow, R.L. (1975). The volunteer subject. New York: Wiley. [A fascinating description of how volunteers and non-volunteers might differ, and the implications of this for the conclusions drawn from psychology experiments].

Sidman, M. (1960). Tactics of scientific research. New York: Basic Books. [A cogent justification for the use of single-subject methods, and a useful description of many different designs of this type].

Notes

1 ‘Student’ was the pseudonym of William S. Gosset, a statistician for Guinness Breweries. He’s best known for inventing the t-test. Not a lot of people know that. (To be read in a Michael Caine accent).

PART 2 ANALYSING AND INTERPRETING DATA

* * *

4 Descriptive Statistics

* * *

On page 25 we saw how probability could be used to give us confidence about a particular hypothesis (in that case detecting whether milk was added before or after tea). The following four chapters expand these ideas to look at the kinds of statistical procedures that have been developed to test research hypotheses. These chapters are intended as a basic grounding in what you need to know to select and interpret statistical tests (and a bit of basic theory in why we use statistics in the first place). There isn’t going to be any detailed coverage of the mathematical mechanics of the tests, or how to do them using computer packages such as SPSS. For that level of information I, not surprisingly, suggest you look at my statistics textbook (Field, 2000) or the teaching notes on my web pages.

4.1 Populations and Samples

* * *

As researchers we are usually interested in answering general questions. So, psychologists are looking for general rules about all people (such as, how do people remember things?), market researchers are also interested in rules about all consumers (why do people buy certain products?), neuroscientists are looking for general rules about biological systems (how do all neurons co
mmunicate?), and physicists are interested in the behaviour of all sub-atomic particles. Whatever it is we want to make generalizations about, the best way to find general rules would be to gather data about every single instance of the things about which you’re interested. An entire collection of things is known as a population. Psychologists are interested in the population of people (i.e. everyone on the planet), market researchers could be interested in the population of consumers, and physicists the population of sub-atomic particles. Populations can be very general (i.e. all people), more specific (i.e. all people suffering from obsessive compulsive disorder) or extremely specific (i.e. all people suffering from obsessive compulsive disorder who recite the lyrics to Lucky by Radiohead backwards before entering a room).

As psychologists, this could mean collecting data from everybody on the planet. Unfortunately we have neither the time nor the resources to do this and so instead we collect data from a small subset of the population in which we’re interested. This subset is known as a sample from the population (Figure 4.1) and we can use the sample to make a guess about what results we would have found had we actually gathered data from the entire population. The size of the sample we take is very important because the bigger the sample the more representative it will be of the population as a whole. To draw a quick analogy, imagine you go to a party and your friend (we’ll call her Andi for argument’s sake) has a few beers and then decides to stick someone’s hat in the freezer (as a random example). How confident would you be that this was representative of the entire population? Would you be prepared to assume that everyone sticks hats in freezers after drinking alcohol? Well, probably not, you might just assume that Andi is a bit odd. What about if you went to a different party and this time you observed that five people all stuck people’s hats in the freezer? My guess is that you’d start to think that perhaps there was something going on and that maybe alcohol did have some effect on hat-freezing behaviour. How about if we go to a very big party (with lots of freezer space) and observe 10,000 people all engaging in hat freezing behaviour?1 Our confidence in this last instance should be very high because we’re seeing the same behaviour in a large number of people. This analogy demonstrates how large samples give us more confidence in our conclusions and this was first noted by Jacob Bernoulli as the ‘law of large numbers’. Obviously the behaviours we observe in different samples will differ slightly, but given that large samples are more representative of the population than small ones it follows that the behaviours observed in different large samples will be relatively similar but the behaviours observed in different small samples are likely to be quite variable.

Figure 4.1 Populations and samples

4.2 Summarizing Data

Frequency Distributions

Once we’ve collected data from a sample of participants we need some way to summarize these data to make it easy for others (and ourselves) to see general patterns within the scores collected. We pretty much have two choices here: we can either calculate a summary statistic (a single value that tells us something about our scores) or we can draw a graph. One very useful thing to begin with is just to plot a graph of how many times each score occurs. This is known as a frequency distribution, or histogram. In clinical psychology there’s a phenomenon called amphetamine psychosis in which people who abuse amphetamines end up having psychotic episodes (such as hallucinations and delusions) even when not taking the drug. Imagine we took a sample of 40 amphetamine users (which we hope is representative of the entire population of amphetamine users) and counted how many hallucinations they had in a day.

Number of hallucinations: 10, 6, 7, 8, 9, 7, 10, 2, 6, 8, 3, 9, 8, 10, 1,5, 8, 4, 2, 9, 10, 6, 7, 8, 9, 7, 10, 2, 6, 8, 3, 9, 8, 10, 1,5, 8, 4, 2, 9.

The first thing we can do is to arrange these scores into descending order:

Number of hallucinations: 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2, 2, 2, 1, l.

It now becomes really easy to count the number of times each score occurs (this is called the frequency). So, we can easily see that six people had 10 hallucinations and only two people had one hallucination. We could count the frequency for each score and then plot a graph with the number of hallucinations on the horizontal axis (also called the X-axis or abscissa), and the frequency on the vertical axis (also called the Y-axis or ordinate). Figure 4.2 shows such a graph and even with this relatively small amount of data two things become clear that were not obvious from the raw scores: (1) the majority of people experience six or more hallucinations (the bars on the right hand side are generally higher than those on the left hand side); and (2) the most frequent number of hallucinations was eight (this value has the tallest bar and we can see that eight of our 40 people experienced this number of hallucinations).

Types of distributions

Frequency distributions come in many different shapes and sizes. It is quite important, therefore, to have some general descriptions for common types of distributions (see Figure 4.3). In an ideal world our data would be distributed symmetrically around the centre of all scores. As such, if we drew a vertical line through the centre of the distribution then it should look the same on both sides. This is known as a normal distribution and is characterized by the bell-shaped curve with which you’ll soon become familiar. This shape basically implies that the majority of scores lie around the centre of the distribution (so the largest bars on the histogram are all around the central value). Also, as we get further away from the centre the bars get smaller implying that as scores start to deviate from the centre their frequency is decreasing. As we move still further away from the centre our scores become very infrequent (the bars are very short). There are two main deviations from this type of distribution and both are called skewed distributions. Skewed distributions are not symmetrical and instead the most frequent scores (the tall bars on the graph) are clustered at one end of the scale. So, the typical pattern is a cluster of frequent scores at one end of the scale and the frequency of scores tailing off towards the other end of the scale. A skewed distribution can be either positively skewed (the frequent scores are clustered at the lower end and the tail points towards the higher or more positive scores) or negatively skewed (the frequent scores are clustered at the higher end and the tail points towards the lower, more negative scores). Distributions also vary in their pointy-ness, or kurtosis (Figure 4.4). Kurtosis, despite sounding like a disease, refers to the degree to which scores cluster in the tails of the distribution. This characteristic is usually shown up by how flat or pointy a distribution is. A platykurtic distribution is one that has many scores in the tails (a so-called heavy-tailed distribution) and so is quite flat. In contrast, leptokurtic distributions are relatively thin in the tails and so look quite pointy.

Figure 4.2 Frequency distribution of the number of hallucinations experienced by amphetamine users

Figure 4.3 Shapes of distributions

Figure 4.4 Pointy and flat distributions

The Mode

I mentioned in the previous section that when we summarize sample data we could do it either with a graph or with a summary statistic. We’ve already looked at a simple way of graphing data to get an idea of the shape of the set of scores (the distribution), and these shapes all seemed to refer back to the centre of the distribution (we talked about scores being close or far from the centre of the distribution). If we’re looking for a single value that sums up our data, then this central point seems like a good place to start. However, there are several ways in which we could find the centre of a distribution. If we’re using a sample of people to tell us something general about behaviour then it makes sense that one thing we might be interested in is a typical score. This score would tell us roughly how most people behaved. If we plot our data points in a histogram (see previous section) we can get a rough idea of what the most common scores are (they have the tallest bars) and the single most common score is often obvious (in Figure 4.2 it is eight
hallucinations). This most common score is called the mode. To calculate the mode we simply place the data in ascending order (to make life easier) and then count how many times each score occurs. The score that occurs the most is the mode. The mode has a couple of advantages: (1) it’s easy to calculate and simple to understand; and (2) we can use the mode with nominal data (see page 6) because we are simply counting the number of times an instance occurs. Before continuing, have a think about what disadvantages the mode has.

There are two main disadvantages of the mode. First, it is possible that a data set has two or more most frequent scores. If a data set has two modes it is known as bimodal, and if it has several modes it is called multimodal. This makes it a messy way to summarize data (you end up saying things like ‘the typical amount of time that people could withstand sitting in a bath of ice was 183 seconds or 3 seconds’). The second problem is that the mode can be changed dramatically if only a single case of data is added – this makes it an unrepresentative measure. Box 4.1 shows these two problems.

The Median

Put simply, the median is the middle score of a distribution of scores when they are ranked in order of magnitude. Obviously you’ll get a middle score only when there’s an odd number of scores, and so if there’s an even number of scores we simply average the two middle scores. Imagine we asked 7 of the people that had amphetamine psychosis to stop taking amphetamines for 6 months and then recorded how many hallucinations they had in one day.

‹ Prev Next ›