How to Design and Report Experiments

Page 32

by Andy Field

Minor changes in the design of your study (and in particular, in your choice of dependent variable) can have a major effect on the conclusions that you will be able to draw from its results.

By answering the following five questions, you should be able to decide which statistical test is the most appropriate one to use. Question 1: what kind of data will I collect? Question 2: how many independent variables will I use? Question 3: what kind of design will I use (experimental or correlational)? Question 4: will my study use a repeated-measures or independent-measures design? Question 5: will my data be parametric or non-parametric?

8.6 Practical Tasks

* * *

Using the five questions in the text, plus the flowchart in Box 8.2, work out which statistical test you would use for each of the following studies:

Is there a relationship between political affiliation and attitudes to the Millennium Dome? A psychologist asked 100 supporters of each of the main political parties whether they approved or disapproved of the project.

Does exposure to mis-spelt words affect spelling ability? A psychologist takes two groups of students. One group is exposed to a list of mis-spelt words, while the other group is exposed to the same words, but correctly spelt. All participants’ ability to spell these words was then tested a week later. The measure was the number of words correctly spelt out of 40.

Do rats make better pets than hamsters? Twenty individuals who had owned both as pets, were asked to provide ratings of hamsters and rats for suitability as pets.

Is there a relationship between belief in flying saucers and watching the ‘X-Files’? A group of 200 people were interviewed about their belief in flying saucers. Seventy said they believed in flying saucers, while the rest said they didn’t. Each was also asked whether they were a regular viewer of the ‘X-Files’.

Does shift-work affect cognitive performance? The performance of two groups of workers on a mental arithmetic test was compared: one group did permanent night shifts, and the other group did permanent day shifts.

Do working mothers differ from non-working mothers in terms of the quality of their interactions with their children? Fifty of each type of mother were video-filmed playing with their child for 30 minutes: one measure taken was the number of positive actions initiated by the mother rather than by the child. The variances were rather different between the two groups.

Are men’s pain thresholds affected by what they see on TV? Fifty men’s pain thresholds are measured three times – once after having watched a video of Clint Eastwood, and again after watching videos of Noddy and Mr. Bean.

Does drinking ‘Diet Coke’ affect window-cleaners’ attractiveness? Each of several women in an office-block is asked to assess the attractiveness of their window-cleaner on two separate occasions: on one occasion, he drinks ‘Diet Coke’, while on the other he does not.

Is there a relationship between the amount of personal possessions one has, and happiness? Three hundred people are interviewed, and asked to provide estimates of how many possessions they have (where ‘possession’ is defined as a single article costing over £ 100) and how happy they are (on a seven-point scale).

Do different types of road-sign affect speeding behaviour differently? A psychologist set up four different road signs: ‘Slow’, ‘30 MPH’, ‘Reduce Speed NOW’ and ‘Police Speed Check’. He then measured the speed of each of the first 40 motorists encountering each sign.

Answers:

Is there a relationship between political affiliation and attitudes to the Millennium Dome? The appropriate test is the Chi-Squared Goodness of Fit. We have frequency data rather than a score per participant: all we know is how many people approve or disapprove of the Dome. This narrows down our choice of test to Chi-Squared. We have only one independent variable (political party), so it must be the Goodness of Fit version of Chi-Squared.

Does exposure to mis-spelt words affect spelling ability? The number of words spelt correctly is a ratio measure (you could have a score of zero if you were an atrocious speller!) Each participant gives us a single score. We have one independent variable, with two levels (exposure or non-exposure to misspellings). It’s an experimental design (we randomly allocate people to one condition or the other, and look for differences between the two groups as a consequence of our manipulation of the independent variable). It’s an independent-measures design, since each participant takes part in only one condition in the experiment. The appropriate test is therefore the independent-measures t-test.

Do rats make better pets than hamsters? Our data consist of attitude ratings. We have one independent variable, with two levels (type of animal: hamster or rat). It’s an experiment, given that we are looking for differences between ratings for rats and hamsters. It’s a repeated-measures design, since each participant is providing two scores, a rating for their pet rat and a rating for their hamster. The data are most likely to require a non-parametric test, firstly because ratings are measurements on an ordinal scale, and secondly because the ratings are likely to be skewed and hence not normally distributed (most people like their pets!) The appropriate test here is the Wilcoxon test.

Is there a relationship between belief in flying saucers and watching the ‘X-Files’? Don’t let the word ‘relationship’ fool you into thinking a correlation should be used here! We have frequency data (the number of people who do or do not believe in flying saucers, and the number who do or do not watch the ‘X-Files’). Thus we’re pretty well stuck with using Chi-Squared. We have two independent variables (belief in flying saucers; and ‘X-Files’ watching habits), so it must be the two-independent-variable version of Chi-Squared, the Chi-Squared Test of Association.

Does shift-work affect cognitive performance? Performance on the mental arithmetic test is almost certainly going to be measured in terms of the number of questions correctly answered, which is a ratio measure. Each worker will give us a single score. We have one independent variable, with two levels (time of shift: night or day). It’s an experimental design. It’s an independent-measures design, since each participant takes part in only one condition in the experiment. We would have to check the data after we collected them to check that they satisfied the other requirements for a parametric test, i.e. homogeneity of variance and normality of distribution. If they did, we would use the independent-measures t-test. (If they didn’t, then we’d fall back on its non-parametric counterpart, the Mann-Whitney test.)

Do working mothers differ from non-working mothers in terms of the quality of their interactions with their children? The number of positive actions initiated by the mother is a ratio measure. Each mother gives us a single score. There is one independent variable (mother’s status: working or non-working). It’s an experimental design, because we are looking for differences between the two groups of mothers. It’s an independent-measures design, since each mother is either working or non-working. So why not use an independent-measures t-test? Well, it says in the question that ‘the variances were rather different between the two groups’. This inhomogeneity of variance violates one of the requirements for using parametric tests such as the t-test, and means that we therefore have to use the t-test’s non-parametric brother, the Mann-Whitney test.

Are men’s pain thresholds affected by what they see on TV? The answer to this question depends on how pain thresholds were measured. We could do this in all sorts of ways (some of which are best left to the imagination!). Let’s suppose we measured the amount of time that each participant could keep their hand in freezing cold water (something which is painful but unlikely to cause significant tissue damage). This would be a ratio measure. We have one independent variable: type of video watched, with three levels (Noddy, Clint or Mr. Bean). It’s an experimental design, since we are looking for differences between these three conditions. It’s a repeated-measures design, since each participant is watching three videos and hence takes part in all conditions of the experiment. Assuming the data turn out to be normally-distributed
and that the variance is similar in all three conditions, then the appropriate test is a repeated-measures Analysis of Variance. If these requirements weren’t met, or if the measurements had been participants’ pain ratings (and hence measured on an ordinal scale), we would have used the non-parametric equivalent of the repeated-measures ANOVA, the Friedman’s test.

Does drinking ‘Diet Coke’ affect window-cleaners’ attractiveness? The appropriate test here is the Wilcoxon test. We have measurements on an ordinal scale (ratings of attractiveness). There is one independent variable, ‘type of drink consumed’ (with two levels: Diet Coke or water). This is an experimental design, since we are looking for differences between conditions, and we have repeated measures (since each woman is asked to give a rating for both of the conditions of the study). Since we have ordinal data, we should be looking for a non-parametric test: Wilcoxon is therefore the test to choose.

Is there a relationship between the amount of personal possessions one has, and happiness? 300 people are interviewed, and asked to provide estimates of how many possessions they have (where ‘possession’ is defined as a single article costing over £ 100) and how happy they are (on a seven-point scale). ‘Number of possessions’ is a measurement on a ratio scale, but ‘ratings of happiness’ is an ordinal-scale measurement. We have a correlational design, since we are looking for a relationship between possessions and happiness. This comes down to deciding whether we should use a parametric or a non-parametric correlation test. We have already noted that one of the variables (happiness) is measured on an ordinal scale: hence the data do not satisfy the requirements for Pearson’s r, and we should use Spearman’s rho.

Do different types of road-sign affect speeding behaviour differently? Our dependent variable here is ‘speed’, which is a ratio measure. We have one independent variable: type of road sign (with four levels, corresponding to the four different types of sign). Different groups of motorists respond to each sign, so it’s an independent-measures design. There are no a priori grounds for thinking that the data won’t satisfy the requirements for a parametric test (although this is something to be checked after the data have been collected). The appropriate test is therefore a one-way independent-measures Analysis of Variance.

A Further Practical Test

Suppose we want to find out if listening to Bedraggled’s singing shrinks your brain. Think about how you could turn this idea into a workable study. Design a study that would give results that could be analysed using (a) a correlation; (b) Chi-Squared; and (c) a parametric statistical test. Think about the advantages and disadvantages in each case.

Answer:

There is no one perfect answer to this question: however, here are some things that you might have considered when designing your studies.

(a) A study designed with a view to using a correlation on the results:

We could get a large group of pop music fans, and get two measures from each one: firstly, an estimate of how frequently they listen to Bedraggled’s singing, and secondly, a measure of their brain size (perhaps by taking an MRI scan of their heads). Both of these measures are ratio data; assuming that the data satisfy the other requirements for a parametric test, (something we could check once we have obtained the data) we could perform a Pearson’s r on the results. A significant positive correlation would tell us that brain shrinkage was associated with listening to Bedraggled – the more you listened, the more your brain shrank.

Advantages:

We can make use of naturally-occurring variations in people’s behaviour: no-one is forcing them to do something unnatural, like listening to Bedraggled if they don’t want to.

Disadvantages:

This has the usual disadvantage of a correlational study: it tells us nothing about causality. In theory, it is just as likely that brain shrinkage causes a person to listen to Bedraggled, as it is that listening to Bedraggled causes brain shrinkage. (Or there might be a third variable that gives rise to variations in the other two: for example, listening to Strides might cause brain damage and a desire to listen to Bedraggled.)

(b) A study designed with a view to using Chi-Squared on the results:

We could take a large group of people and divide them into four categories, based on the permutations of our two independent variables (Bedraggled-listening and brain size): frequent listeners to Bedraggled who had experienced brain shrinkage; frequent listeners to Bedraggled who had not experienced brain shrinkage; infrequent listeners who had shrunken brains; and infrequent listeners whose brains remained unshrunken. To do this, we would have to devise some criteria as a basis for allocating people to one category or another. So, for example, we might define ‘frequent listening to Bedraggled’ as listening for more than seven hours a week, and ‘brain shrinkage’ as having a brain that was 10% smaller than the norm for that person’s age. Our results would be the frequencies with which people fell into our four categories. We could perform a Chi-Squared Test of Association on the data: if Bedraggled listening was associated with brain shrinkage, there should be considerably more people in the category of ‘Bedraggled fans with small brains’ than in the other categories.

Advantages:

Again, it makes use of naturally-occurring variations in people’s behaviour, as in the correlational study.

Disadvantages:

It doesn’t give us much information. We have simply lumped people into listeners versus non-listeners, and brain-shrunk versus non-shrunk. In this particular case, our criteria for lumping are fairly arbitrary - someone else might have defined ‘frequent listening’ as anything over 10 minutes per week, or ‘brain shrinkage’ as having occurred only if the brain was 20% smaller than average. If we used different criteria, we might have obtained different results. In any case, we get only a crude idea of what the effects of listening to Bedraggled might be. The results might tell us that listening to Bedraggled for more than seven hours a week is associated with brain shrinkage, but it doesn’t tell us whether the effects of Bedraggled listening are cumulative (i.e. the more you listen, the more your brain shrinks) or all-or-none; and it still tells us nothing about the causal relationship between Bedraggled-listening and brain-shrinkage.

(c) A study designed with a view to using parametric statistics on the results:

We could do an experiment on this issue. We could take some participants, and randomly allocate them to one of two conditions: half get prolonged exposure to Bedraggled’s singing, and the other half are exposed to an equivalent amount and volume of harmless music (for example by Riff Clichard, whose music has been around long enough for it to be known that it poses no serious threat to health). After the exposure period is finished, we measure the brain-size of each participant. If Bedraggled’s music produces brain shrinkage, then participants in the Bedraggled-exposure condition should have smaller brains than participants in the Riff Clichard condition. ‘Brain shrinkage’ is a ratio measure, so as long as the data turn out to satisfy the other requirements for a parametric test, we could use an independent-measures t-test.

Advantages:

We can directly and unequivocally determine cause and effect: we manipulate Bedraggled-listening, and examine the effects of our manipulations on brain shrinkage. If brain shrinkage has occurred, it must be due to what we did to the participants, since they were randomly allocated to the two conditions of the study.

Disadvantages:

If we really did think that listening to Bedraggled caused brain damage, this would be a highly unethical experiment! That apart, we would be getting participants to listen to Bedraggled when this was not necessarily something they would do voluntarily. Consequently if the study found evidence of brain damage as a result of listening to Bedraggled, we still would not know if shrinkage would occur in people who voluntarily exposed themselves to their music. (It might be that shrinkage has been caused by listening to Bedraggled in conjunction with the stress of not being able to turn them off!) Finally, we still know nothing about how much B
edraggled listening is required for brain shrinkage to occur. However, we could find this out by testing more experimental groups, each of which received a different duration of exposure to Bedraggled. (We would then analyse the data using a one-way independent-measures Analysis of Variance).

Hopefully you have learnt from this example, that (a) no one design is foolproof – each has its merits as well as its disadvantages; and (b) decisions made about statistics at the outset have major consequences on what you will be able to conclude from your results.

PART 3 WRITING UP YOUR RESEARCH

* * *

9 A Quick Guide to Writing a Psychology Lab-Report

* * *

The following chapters give a detailed explanation of how to write a psychology lab-report. If you are new to psychology, you might want to read this short chapter first, to give yourself an overview of what’s involved.

9.1 An Overview of the Various Sections of a Report

* * *

Lab-reports are modelled on the scientific journal article. Like them, the report is divided into sections, each of which provides a specific type of information. Here, we provide a short description of what should be contained in each section, followed in each case by a brief illustration from a wholly fictitious and potentially offensive study on national stereotypes. Chapter 16 shows a complete, and fuller, sample report: you might want to have a quick look at that as well, to give yourself a feel for what a lab-report should look like, before going on to read Chapters 10–15.

‹ Prev Next ›