by Ian J. Deary
In fact, that was Hunter’s estimate of savings if you applied a simple test of general mental ability, a psychometric intelligence test, instead of nothing at all. What if you chose to interview the people rather than give them a psychometric intelligence test? You would lose $11,640,000,000 of the 15 billion dollars. You’d lose over 8 billion dollars of the 15 billion if you used only reference checks. Hunter concluded that not using a simple general mental ability test in hiring could cost up to about 20% of the USA’s total federal budget in productivity losses. So, we might conclude that good hiring of the best potential workers can make a sizeable difference. Let’s look at where Hunter got these figures.
Hunter and his colleagues have made a speciality of something called meta-analysis. What that means is that they tend not to do individual research studies themselves. Instead, they systematically search through the scientific literature for all of the research studies ever done on a topic and they try to put them together to come to a coherent, quantitative conclusion. The area that they meta-analysed is job-hiring decision-making. They have pored through studies conducted over 85 years of psychological research. They have read and filleted thousands of studies to form their conclusions. They have compiled a comprehensive guide to what is best in selecting for job performance. Though their research papers can be quite technical and bristling with statistics, they have a strong and simple message. Hiring decisions matter: they can make or lose you a lot of money. And there is nothing more important in hiring than having something, some set of open and fair criteria for selection, that relates as highly as possible to how well the person will do the job. That’s the key then: what are the best ways of selecting people to do a job well?
In 1998 Hunter published a long paper with Frank Schmidt in the American Psychological Association’s top review journal, the Psychological Bulletin. In it they examined the relative predictive power of 19 different ways of selecting people for jobs. Everything from interviews, through intelligence testing, and having people try out the job, to examining the applicants’ handwriting (a popular method in France and Israel especially). There’s a selected summary of these results in Figure 24; the diagram represents the cumulative knowledge from almost a century of research and thousands of research studies.
Look at Figure 24. Each of the columns represents a different way of hiring people (selection methods). The length of the column represents the size of the correlation between people’s rankings on each selection
24. Some factors that predict job performance. The longer the column, the better the prediction. The numbers are correlation coefficients.
method and later performance on the job. The longer the column, the stronger is this relation. The longer the column, the better is the method of selection. The longest column belongs to work sample tests. This is the situation where you can get all of the applicants actually to do the job for a time and assess how efficient they are. These are costly to set up and they are far from universally applicable; by no means the majority of jobs lend themselves to this type of procedure. Note, too, that highly structured employment interviews do relatively well, but the more typical unstructured interviews are poorer. Reference checks on their own are not especially helpful. Years of job experience and years of education do not offer much information that’s going to predict people’s performance in doing the job. Age is totally uninformative and shouldn’t be used as a selection criterion; and so is graphology, the analysis of handwriting. It tells you nothing about how well the person will do the job – and yet it is used widely in some countries for job selection. Not only is selection by this method losing people money in making sub-optimal selection decisions, the cost of having it done is wasted too. And it is unfair: it ends up rejecting people for something that is entirely unrelated to their ability to do the job.
In Figure 24 you can see that the column for the general intelligence/psychometric test is comparatively long, almost as good as the best predictors of job performance. It does offer some useful information about how well people, on average, will do the job in many types of employment. Unlike other selection methods, it can be applied near-universally. It can be given for jobs where it is not possible to do a job tryout or compose a highly structured interview. For example, work sample tests can only be done by people who know how to do the job in the first place. Compared to most other methods, the general mental ability test is quick, cheap, and convenient. It has the lowest cost of any of the relatively good methods. Looking over the research literature, there is far, far more evidence for the success of the general mental ability test than any other method of selection. It’s been used in many more research studies than any other method.
Tests of general intelligence have other merits in the job selection process. They are the best predictors of which employees will learn most as they progress on the job. They are the best predictors of who will benefit most from training programmes. However, the power of the general intelligence test to predict job success is not equal for all types of job. The more professional, the more mentally complex the job is, the more successfully the mental test score will predict the success on the job. Therefore, mental tests do poorest in totally unskilled jobs and are much better at predicting success on professional and skilled jobs. In their research report Schmidt and Hunter concluded that:
Because of its special status, GMA [tests of general mental ability or general intelligence] can be considered the primary personnel measure for hiring decisions, and one can consider the remaining 18 as supplements to GMA measures.
What they meant was that you’d be well advised to use a test of general intelligence in most job-hiring situations, because they are cheap and quick and almost universally applicable, and modestly informative. But there’s an obvious question that follows on from that. If we add some of the other hiring methods to a general mental ability test, which will add the most power to our hiring decisions? So Hunter looked at those methods which added the highest extra amounts of predictive power, assuming that we have already used a general intelligence test. The best was an integrity test, which added another 27% to the predictive power. Giving a work sample or a structured interview would both add 24% extra predictive power. Where these could be applied then, it would be sensible to add one or more of these to the general mental test. Using multiple methods is sensible in these cases, because it leads to even better decisions. Tests of conscientiousness and reference checks are also helpful additions to the general mental ability test.
In this setting then – finding a bunch of people who will do a range of jobs better than just taking people at random – an intelligence test has utility. No, it will not predict all that strongly how well people do a job. Yes, you will still hire people who are hopeless and with whom you can’t get on. But, on the whole, you’d be better off including a general mental ability test in your portfolio of selection methods.
In order to avoid an accusation of gross over-simplification, let me repeat that we all know it takes more than brains to be successful, and sometimes it does not take brains much at all. Returning to Sir Walter Scott’s Kenilworth, we can see that the young Walter Raleigh, as he addressed some older and less successful courtiers, knew that he could progress far beyond them, given the possession of other qualities.
‘Why, sirs,’ answered the youth [Raleigh], ‘ye are like goodly land, which bears no crop because it is not quickened by manure; but I have that rising spirit in me which will make my poor faculties labour to keep pace with it. My ambition will keep my brain at work, I warrant thee.’
To follow this area up …
These papers are technical along the way, but the discursive sections are written with laudable clarity. These authors make their forceful conclusions lucidly, having first assembled frighteningly large bodies of evidence. If the latter paper were not so new I should have no hesitation in calling these papers ‘classic’ works in psychology.
Hunter, J. E. & R. F. Hunter (1984). Validity an
d utility of alternative predictors of job performance. Psychological Bulletin, 96, 72–98.
Schmidt, F. L. & J. E. Hunter (1998). The validity and utility of selection methods in personnel psychology: practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–74.
More applications of intelligence testing in education and the workplace are summarized in the American Psychological Association’s Task Force report that is covered in Chapter 7. I strongly recommend you read that. If you are interested in the origins of the first mental tests by Binet in France and their subsequent export to (and over-use in) the USA, then the most comprehensive and fair book I have read on this topic is the following.
Zenderland, L. (1998). Measuring Minds: Henry Herbert Goddard and the Origins of American Intelligence Testing. Cambridge: Cambridge University Press.
For details of research papers my colleagues and I have produced over the last 15 years or so, visit my website at http://129.215.50.40/Staff/staff/ijd/pubs_complete.html. A substantial proportion of these research reports used mental ability tests in medical settings, to discover whether some medical conditions and some medical treatments damage or enhance human intelligence test scores. That type of research is nowhere gathered together as a single body; it was not possible to describe it in the setting of a meta-analysis in the way that I was able to do because of the Hunters’ research in job selection.
As an example of how mental ability tests play a leading role in some medical issues, here’s an editorial article I wrote with a colleague in 1996 for the British Medical Journal. This is also available on the British Medical Journal website, which has free access: (http://www.bmj.com/cgi/content/full/313/7060/767).
Deary, I. J. & B. M. Frier (1996). Severe hypoglycaemia in diabetes: Do repeated episodes cause cognitive decrements? British Medical Journal, 313, 767–8.
So far in this chapter I have emphasized the practical uses of intelligence tests for the users of the tests: the business person who wants to hire the best staff, the doctor who wants to know about her patients’ mental capacities, and so forth. Another angle on the utility of tests is to ask what they mean to you: that is, what are the betting odds on life outcomes given a certain level of intelligence? Linda Gottfredson’s chapter explains that intelligence tests are not testing abstruse, academic abilities, and that they relate to important outcomes across the whole range of life’s domains. The Bell Curve (discussed in more detail in Chapter 7) is also worth a look.
Gottfredson, L. (2000). g: Highly general and highly practical. In R. J. Sternberg & E. L. Grigorenko (eds), The General Intelligence Factor: How General Is It? New York: Lawrence Erlbaum.
Herrnstein, R. J. & C. Murray (1994). The Bell Curve. New York: Free Press.
Chapter 6 The lands of the rising IQ
Is intelligence increasing generation after generation?
If my score on an IQ-type test is higher than yours, then does it mean that I am brighter/cleverer than you? If the test used was one of the best indicators of the general intelligence factor, or if it was one of the more comprehensive test batteries, such as one of the Wechsler tests, then we might be persuaded provisionally to accept that conclusion and ask for more information. We might be further persuaded if we were genetically related and lived in a similar culture. The next dataset calls the mental ability testing enterprise into question by demonstrating large differences in mental test scores in just those situations where we might expect similarity. The key researcher involved is James Flynn, a political scientist working at the University of Otago, New Zealand. He has provided researchers in the field of human intelligence with a scientific conundrum and massive communal headache.
The first thing Flynn brought to serious scientific scrutiny was that mental test companies had to renorm their scores every so often. This rather boring-sounding, technical problem was the source of one of the largest unexplained puzzles in the field of intelligence research today. When you buy a mental test from a psychometric company, you get the test questions and the answers, and you get instructions for giving the test in a standard way so that everyone who takes the test gets an equal chance to score well on it. But, imagine that you have now tested someone on the test: you will realize that you need something else. The person’s score does not mean anything unless you have some indication of what is a bad, good, and indifferent score. Thus, with the test, when you buy it, you will get a booklet of normative scores, or ‘norms’. This is a series of tables which indicate how any given score fits into the population’s scores. Usually they are divided for age, because some test scores change with age (Chapter 2). Therefore, you can find out how your testee did when compared with their age peers. Usually the tables with tell you what percentage of the population would have scored better and worse than the score that your testee obtained. Those of us with children and who measured their heights and compared them with the population average for their ages will be familiar with this type of referencing.
James Flynn noticed that these tables of norms had to be changed every several years. As new generations came along they were scoring too well on the tests. The tests seemed to be getting easier. A generation or two after the companies produced the tables of normative scores, the ‘average’ person of the later generation was scoring way above the ‘average’ person of the earlier generation. For example, 20-somethings tested in the 1980s were doing better on the same test than 20-somethings from the 1950s. The norms were becoming outdated – ‘obsolete’ was Flynn’s term. (There’s an ironic parallel with the trend in A-level results in England. Children have been scoring better than they used to on these tests, with resulting arguments about whether the teaching is better or the examinations are getting easier. At least, in the case of IQ-type tests, the content has remained the same.)
The response of the test companies was to ‘renorm’ the tests. The norms tables were altered so that, as time went on, it became harder to achieve a score that got you above any given percentage of your peers. Thus, if you scored the exact same test score on the exact same test in, say, 1950 and 1970, you would have a higher IQ in 1950 than in 1970. In fact, it can be seen as worse than that. Let’s say you take the test on the last day that the institution testing you used the old norms. You take the test and you obtain a given score. The tester looks up the norms tables and states that you make the cut above some percentage of your age peers. If you took the same test on the first day of the new norms the same score would put you significantly further down the percentage of the population. In fact, the test companies would not always alter the norms tables. The other manoeuvre they adopted was to make the test harder so that you had to take a new, harder test to get to the same point on the population scale.
In summary, as the 20th century progressed, the whole population’s scores on some well-known mental tests were improving when compared with same-age people generations earlier. Just as average height has increased over generations, people began to wonder if intelligence was rising.
Flynn published a scientific paper in Psychological Bulletin in 1984 that gave IQ test-users an alarm call to a potential disaster. ‘Everyone knew’ that tests had to be renormed every so often, but Flynn quantified the effect and spelled out its consequences. He quantified the effect in a smart piece of psychological detective work. He searched for every study he could find in which groups of people were given two different IQ tests for which the norms were collected at least 6 years apart. This is the key idea. Flynn set about asking: what would the sample’s estimated IQs be when compared with the earlier and the later norms? For clarity, he decided to look exclusively at samples of white Americans. He found 73 studies, involving a total of 7,500 people, aged from 2 to 48 years. These studies involved the Stanford–Binet and the Wechsler scales, tests at the very centre of the intelligence testing world.
Flynn found that subjects’ estimated IQs were higher when they were compared with older norms, by contrast with more recent
ones. On perusing all the samples involved, it became clear that the effect was fairly constant over the period from 1932 to 1978. During that time white Americans gained over 0.3 of an IQ point every year, about 14 IQ points over the epoch. So, over the middle part of the 20th century, the American IQ rose by a massive amount. Flynn warned us:
If two Stanford–Binet or Wechlser tests were normed at different times, the later test can easily be 5 or 10 points more difficult than the earlier, and any researcher who has assumed the tests were of equivalent difficulty will have gone astray. (p. 39)
Allowing for obsolescence in intelligence testing is just as essential as allowing for inflation in economic analysis. (p. 44)
This takes some reckoning with and becomes even more surprising when the trend in SAT scores is added to the picture. The Scholastic Aptitude Test (SAT) is a high-level test taken at the end of school by America’s educational elite. It is well documented that, over the period in which IQ scores were rising, the verbal scores – call it general knowledge for now – on the SAT were declining. And SAT scores and IQ scores are very highly correlated: yet one is decreasing over time while the other increases. If the IQ increases over time reflect a real rise in intelligence, and the SAT decreases are real decrements in knowledge, then one is forced to conclude that that aspect of the SAT that does not depend on intelligence (remember, IQ and SAT are highly correlated) must have gone down. Something that determines SAT scores (but not intelligence level) must have suffered massively at the same time that IQ went up. As Flynn worried: