How to Design and Report Experiments
Page 30
Box 8.1: All you need to know about Chi-Squared Tests
The Chi-Squared test (χ2) is useful if – despite our best efforts at warning you against it – you have ended up with nominal (categorical) data with each person contributing only once to each category (i.e. your data are between-subjects). If you only have the frequency with which certain events have occurred, then Chi-Squared can be used to see if these observed frequencies differ from those that would be expected by chance. Here’s the formula for Chi-Squared:
Chi-Squared is thus the sum of the squared differences between each observed frequency and its associated expected frequency. The bigger the value of χ2, the greater the difference between observed and expected frequencies, and hence the more confident we can be that the observed frequencies have not occurred by chance. (In practice, you need to take into account the ‘degrees of freedom’, which is simply the number of categories minus one).
Two common uses of Chi-Squared are set out below. More details on Chi-Squared (including how to work it out by hand) can be found in Siegel and Castellan (1988).
A: The one-independent variable case: the χ2 ‘Goodness of Fit’ test
This can be used to compare an observed frequency distribution to an expected frequency distribution. It’s most often used when you have the observed frequencies for several mutually-exclusive categories, and you want to decide if they have occurred equally frequently. Suppose, for example, a leading soap powder manufacturer wanted us to find out which name was most attractive for their new washing powder. We could take 100 shoppers, present them with an array of different names for soap powders, and ask them to choose a single name, the one that they thought was most attractive. The frequencies with which the names were chosen would be our observed frequencies: the expected frequency of occurrence for each category in this example, would be 100/5 = 20. The degrees of freedom for this test will be k – 1, where k is the number of categories used (in this case there are 5 categories and so df = 4).
Number of shoppers picking each name:
χ2 = 52.5, with 4 df, p < .001. It appears that the distribution of shoppers across soap-powder names is not random. Some names get picked more than we would expect by chance and some get picked less.
B: Chi-Squared as a Test of Association between two independent variables
Another common use of χ2 is to determine whether there is an association between two independent variables. For example, is there an association between gender (male or female) and choice of soap powder (Washo, Musty, etc.)? Here’s a contingency table, containing the data for a random sample of 100 shoppers, 70 men and 30 women:
Number of shoppers picking each name:
These are our observed frequencies. To calculate the expected frequencies is a little more involved then in our previous example, but based on similar logic. Using the same formula as before, χ2 with 4 df = 52.94, p < .001. The degrees of freedom for when we have a contingency table are equal to (number of columns – 1) × (number of rows – 1). In this case we have 5 columns and 2 rows, so the df = (5 – 1) × (2 – 1) = 4 × 1 = 4.
Our observed frequencies are significantly different from the frequencies that we would expect to obtain if there were no association between the two variables. In other words, the pattern of name preferences is different for men and women.
Problems in interpreting Chi-Squared
The Chi-Squared test merely tells some relationship exists between the two variables in question: it does not tell you what that relationship is, and most importantly, it does not tell you anything about the causal relationship between the two variables. In the present example, it would be tempting to interpret the results as showing that being male or female causes people to pick different soap powder names. In this case, that’s probably a reasonable assumption, since it’s unlikely that the soap powder names cause people to be male or female. However, in principle the direction of causality could equally well go the other way. Chi-Squared merely tells you that the two variables are associated in some way: precisely what that association is, and what it means, is for you to decide.
Another problem in interpreting Chi-Squared is that the test tells you only that the observed frequencies are different from the expected frequencies in some way: the test does not tell you where this difference comes from. Usually, but not always, you can get an idea of why the Chi-Squared was significant by looking at the size of the discrepancies between the observed and expected frequencies. In many cases, however, Chi-Squared contingency tables are not so easy to interpret!
Assumptions of the Chi-Squared test
For a Chi-Squared test to be used, the following assumptions must hold true:
Observations must be independent: each subject must contribute to one and only one category. Failure to observe this rule renders the Chi-Squared test results completely invalid.
Problems arise when the expected frequencies are very small. See Siegel and Castellan (1988) for details.
Why you should avoid Chi-Squared if you can
Once again, we stress that you should design your study so you can avoid using Chi-Squared! The main problem with obtaining frequency data is that they provide so little information about your participants’ performance. All you have is knowledge about which category someone falls into, and this is a very crude measure. Consider the examples in this box. All we have obtained from our participants is knowledge of which soap powder they liked most – nothing else. What a waste of time and effort! Had we obtained attractiveness ratings from them for each powder, we would be able to find out so much more. For example, by how much do the soap powders differ in attractiveness? Are people in close agreement about this, or are there large individual differences? How do the ratings by men compare to those by women? Do men and women show different patterns of ratings? Just a minor change to the design – obtaining scores rather than merely categorizing people – gives us a lot more information.
However, suppose we make some minor changes to our study. Instead of merely assigning participants to one of three categories according to whether they thought the new striping pattern was better, the same or worse, we now ask them to rate each of the two striping patterns separately, using a seven-point scale that runs from ‘very inconspicuous’ to ‘highly conspicuous’. Now, we have two scores for each participant (the number that corresponds to their rating of the new pattern, and the number that corresponds to their rating of the old pattern). We can perform a Mann-Whitney test on these data, and see whether the ratings for the two patterns are significantly different. We’ve gone to the same amount of trouble as in the previous version of the study, using the same emergency vehicles and the same number of participants. However, by making just a small modification to the procedure, we have extracted a lot more useful information. As well as determining whether or not participants thought the two patterns differed in detectability, the data we obtained would permit us to work out descriptive statistics (e.g. the median rating to each of the patterns, and some measure of the range of responses) that would enable us to know to what extent the two patterns were considered different. We would also be able to get some idea of how responses to each of the patterns were distributed. We might find that everyone thought the new pattern was markedly better than the old one; or it might be that most, but by no means all, participants thought the new pattern was a moderate improvement over the old one. These are two quite different conclusions, but it would have been impossible to distinguish between them with the previous experimental design.
We could do better still. The previous study is based on people’s subjective responses to the two patterns: we haven’t really measured the detectability of the two patterns, so much as people’s opinions of their detectability (which isn’t necessarily the same thing, because people may not necessarily have much insight into their cognitive abilities with respect to object recognition). So, we could do the same study, but this time measure how quickly people can detect the two different striping patt
erns. For example, we could have two vehicles, one with the new striping pattern and one with the old. Each participant does several trials. On a given trial, the participant sees an array of vehicles, and has to decide whether or not a striped one is present. We measure the time it takes to respond. Each participant provides two ‘mean reaction time’ scores: the average of their correct ‘present’ responses to the vehicle with the new striping pattern, and the average of their correct ‘present’ responses to the vehicle with the old pattern.
At first sight, this doesn’t look like much of an improvement over the previous version of our study. However, there are actually quite a few benefits to designing the study this way. First (as long as the assumption is correct that detectability is reflected in reaction times), we have a more direct, less subjective measure of detectability: we haven’t asked participants if they think that one pattern is more conspicuous than the other, we have got them to demonstrate this directly, in their behaviour (see Chapter 2). Second, because we have obtained data which satisfy the requirements of a ‘parametric’ test, we can use more powerful statistical tests on our data – more powerful in the sense that they are more likely to uncover any differences between our experimental conditions than the non-parametric test we used in the previous version of this study (see below for more on this, and have a look at the discussion of statistical power in Chapter 3). We could do a repeated-measures t-test on these reaction-time (RT) data, to see if the average RT to the new pattern was faster or slower than the average RT to the old pattern.
Finally, we could design the study so that we would be able to perform more sophisticated statistics on the data, and ask more sophisticated questions as a result. Suppose we had obtained reasonably similar numbers of male and female participants, and that within each sex, we had similar numbers of young, middle-aged and elderly participants. So far, we have been confined to answering the question ‘do people perceive one of the patterns as more detectable than the other?’ Now, we could include ‘gender’ and ‘age’ as additional independent variables in our data analysis, and answer more complex questions such as: ‘do men and women differ in their perceptions of the two striping patterns?’; ‘do the effects of the striping patterns differ according to the age of the perceiver?’; and even ‘is the detectability of the striping pattern affected by the participant’s sex, age or any combination of these?’ For example, it might be the case (I can’t think why, mind you!) that there is less difference in detectability between the two patterns for elderly men than there is for young females.
You could answer these questions by comparing the frequencies with which the different age-groups and genders detected the patterns (using Chi-Squared (see Box 8.1), or its more sophisticated big brother, log-linear analysis). However, again, a minor change in our design would reap large benefits. If we obtained a score per participant (e.g. each participant’s reaction time to detect the pattern) we could use parametric statistical tests with this design, tests which are better suited to detecting the interactions between variables that provide the answers to interesting questions like these (see Sections 6.8, 6.9 and 6.10 for more information on these tests). This version of the study would involve little more work than the crude one outlined at the beginning of this example, but would enable us to find out much more about people’s responses to striping patterns. Hopefully, this will convince you that giving a little thought at the design stage to what you are going to measure can pay off handsomely in terms of what you will be able to conclude from the results obtained.
8.2 Five Questions to Ask Yourself
* * *
Why am I here? What’s the purpose of life? Is there a God? Perhaps a more pressing question is to know which statistical test to use. Making this decision is fairly straightforward as long as you tackle the task systematically. All you need to do is ask yourself five questions about the nature of the study, and the type of data that you have obtained. (You do have to understand the questions; however, explanation of the necessary concepts is included in this book, and summarized below, so you have no excuse for not picking the correct test!)
Question 1: what kind of data will I collect?
Question 2: how many independent variables will I use?
Question 3: what kind of design will I use (experimental or correlational?)
Question 4: will my study use a repeated-measures or independent-measures design?
Question 5: will my data be parametric or non-parametric?
If you can answer these five questions correctly, and follow the flowchart in Box 8.2, you should be able to work out which test would be most appropriate for your data. Let’s take the questions in turn.
Question 1: What Kind of Data Will I Collect – Frequencies or Scores?
What sort of data are you obtaining from each participant? Does each participant give you one or more scores? Or do they merely contribute to a frequency, in the sense that the data consist of how many participants fall into each of several categories? (We’ve been at pains throughout this book to explain how you should avoid getting frequency data if you can possibly do so, because you end up being stuck with Chi-Squared as your main way of analysing the data. So if you have ended up with frequency data nevertheless, don’t complain to us!) Here’s an example, to make the distinction clear. Suppose a study looked at whether different European countries had different car accident rates. The data would consist of the frequency of car accidents in each country: we would know nothing about our individual participants except that they had experienced an accident and therefore had contributed to their country’s accident total. The data are thus the frequency of occurrence of each of several categories (‘British car accidents’, ‘French car accidents’, etc.). If, on the other hand, we took a sample of drivers from each country and recorded each individual driver’s accident rate, our data would now consist of scores – a score per participant, which would be that individual’s personal accident rate.
If your data consist of frequency data, you are pretty much stuck with doing some variant of Chi-Squared. Follow the flow-chart to decide which version of Chi-Squared you should use. If there is just one independent variable in your study, then ‘Chi-Squared Goodness of Fit’ is the version of Chi-Squared to use. If you have two independent variables, use the ‘Chi-Squared Test of Association’. Confused? Here’s an example to make the distinction clearer. Suppose we were interested in how people got to work: here, we have one independent variable, ‘mode of transport’. There could be lots of different categories of this, ‘bus’, ‘motorcycle’, ‘skateboard’, etc., but they are all instances of ‘mode of transport’. So, in this case, the ‘Chi-Squared Goodness of Fit’ is the test to go for. Now suppose we were also interested in sex differences in how people got to work: now, we have a second independent variable in our study, ‘sex of traveller’. We would be interested in asking questions like ‘do men and women use different modes of transport to get to work?’ In other words, we have two independent variables and we are interested in whether there is an association between them. In this case, the ‘Chi-Squared Test of Association’ is the correct version of Chi-Squared to use.
If your data consist of scores, then you need to work out the answers to the remaining questions below.
What kind of scale are your data measured on? It’s important to know what kind of data you are dealing with, as certain statistics can only be used validly with certain kinds of data. (See Chapter 1 for a detailed discussion of these issues: they are mentioned briefly again here to jog your memory and save you having to hold the book open at two places at once.)
Nominal (categorical) data: With these kind of data, numbers are used only as names for categories. Therefore they are not really acting as numbers, but just serving as labels.
Ordinal (rank) data: On an ordinal scale, things can be ranked or ordered in terms of some property such as size, length, speed, time etc. However, successive points on the scale are not necessarily spaced equally a
part.
Interval data: These are measured on a scale on which measurements are spaced at equal intervals, but on which there is no true zero point (although there may be a point on the scale which is arbitrarily named ‘zero’).
Ratio data: This is the same as the interval scale, except that there is a true zero on the scale, representing a compete absence of the thing which is being measured. With this kind of scale, the intervals are equally spaced and we can make meaningful statements about the ratios of quantities measured.
The level of measurement is something that is wholly under your control, and known in advance of carrying out the study. There are other aspects of the data that are also important, but since these are normally discovered once the data are collected, rather than known in advance; we will consider these separately in answering Question 5 below.
Question 2: How Many Independent Variables Will I Use?
In Chapter 2 we saw that an independent variable is something that is manipulated by you, the experimenter (as opposed to a dependent variable, which is something that you measure). There are two types of independent variable. In the first type, you are free to choose which values of the independent variable to use. So, for example, suppose your independent variable was the time delay between stimulus presentation and a memory test. You are free to use whatever levels of this independent variable that you like. You might have delays of 1 day, 2 days and 3 days; or you might have delays of 1 hour, 1 week and 1 month. It’s entirely up to you. The same goes for ‘age of participant’: you could pick any age-categories that you want.