I Think You'll Find It's a Bit More Complicated Than That

Page 6

by Ben Goldacre

Guns don’t kill people, puppies do. The world is a really big place.

Datamining for Terrorists Would Be Lovely If It Worked

Guardian, 28 February 2009

This week Sir David Omand, the former Whitehall Security and Intelligence Co-ordinator, described how the state should analyse data about individuals in order to find terrorist suspects: travel information, tax and phone records, emails, and so on. ‘Finding out other people’s secrets is going to involve breaking everyday moral rules,’ he said, because we’ll need to screen everyone to find the small number of suspects.

There is one very significant issue that will always make datamining unworkable when used to search for terrorist suspects in a general population, and that is what we might call the ‘baseline problem’: even with the most brilliantly accurate test imaginable, your risk of false positives increases to unworkably high levels, as the outcome you are trying predict becomes rarer in the population you are examining. This stuff is tricky but important. If you pay attention you will understand it.

Let’s imagine you have an amazingly accurate test, and each time you use it on a true suspect, it will correctly identify them as such eight times out of ten (but miss them two times out of ten); and each time you use it on an innocent person, it will correctly identify them as innocent nine times out of ten, but incorrectly identify them as a suspect one time out of ten.

These numbers tell you about the chances of a test result being accurate, given the status of the individual, which you already know (and the numbers are a stable property of the test). But you stand at the other end of the telescope: you have the result of a test, and you want to use that to work out the status of the individual. That depends entirely on how many suspects there are in the population being tested.

If you have ten people, and you know that one of them is a suspect, and you assess them all with this test, then you will correctly get your one true positive and – on average – one false positive. If you have a hundred people, and you know that one is a suspect, you will get your one true positive and, on average, ten false positives. If you’re looking for one suspect among a thousand people, you will get your suspect, and a hundred false positives. Once your false positives begin to dwarf your true positives, a positive result from the test becomes pretty unhelpful.

Remember this is a screening tool, for assessing dodgy behaviour, spotting dodgy patterns, in a general population. We are invited to accept that everybody’s data will be surveyed and processed, because MI5 has clever algorithms to identify people who were never previously suspected. There are 60 million people in the UK, with, let’s say, 10,000 true suspects. Using your unrealistically accurate imaginary screening test, you get 6 million false positives. At the same time, of your 10,000 true suspects, you miss 2,000.

If you raise the bar on any test, to increase what statisticians call the ‘specificity’, and thus make it less prone to false positives, then you also make it much less sensitive, so you start missing even more of your true suspects (remember, you’re already missing two in ten of them).

Or do you just want an even more stupidly accurate imaginary test, without sacrificing true positives? It won’t get you far. Let’s say you incorrectly identify an innocent person as a suspect one time in a hundred: you still get 600,000 false positives, out of the UK population. One time in a thousand? Come on. Even with these unfeasibly accurate imaginary tests, when you screen a general population as proposed, it is hard to imagine a point where the false positives are usefully low, and the true positives are not missed. And our imaginary test really was ridiculously good: it’s a very difficult job to identify suspects just from slightly abnormal patterns in the normal things that everybody does.

Then it gets worse. These suspects are undercover operatives: they’re trying to hide from you, they know you’re datamining, so they will go out of their way to produce trails which can confuse you.

And lastly, there’s the problem of validating your algorithms, and calibrating your detection systems. To do that, you need training data: 10,000 people where you know for definite if they are suspects or not, to compare your test results against. It’s hard to picture how that can be done.

I’m not saying you shouldn’t spy on everyday people; obviously I have a view, but I’m happy to leave the morality and the politics to those less nerdy than me. I’m just giving you the maths on specificity, sensitivity and false positives.

Benford’s Law: Using Stats to Bust an Entire Nation for Naughtiness

Guardian, 17 September 2011

This week we might bust an entire nation for handing over dodgy economic statistics. But first: why would they bother? Well, it turns out that whole countries have an interest in distorting their accounts, just like companies and individuals. If you’re a Euro member like Greece, for example, you have to comply with various economic criteria, and there’s the risk of sanctions if you miss them.

Government figures are subjected to various forms of audit already, of course, but alongside checking that things marry up with each other, forensic statisticians also have a few interesting tricks to try to spot suspicious patterns in the raw numbers, and so estimate the chances that figures from a set of accounts have been tampered with. One of the cleverest tools is something called Benford’s law.

Imagine you have the data on, say, the population of every country in the world. Now, take only the ‘leading digit’ from each number: the first number in the number, if you like. For the UK population, which was 61,838,154 in 2009, that leading digit would be six. Andorra’s was 85,168, so that’s eight. And so on.

If you take all those leading digits, from all the countries, then overall you might naïvely expect to see the same number of ones, fours, nines and so on. But in fact, for naturally occurring data, you get more ones than twos, more twos than threes, and so on, all the way down to nine. This is Benford’s law: the distribution of leading digits follows a logarithmic distribution, so you get a ‘one’ most commonly, appearing as first digit around 30 per cent of the time, and a nine as first digit only 5 per cent of the time.

The next time you’re waiting for a bus, you can think about why this happens (bear in mind what leading digits do when quantities repeatedly double, perhaps). Reality agrees with this theory pretty neatly, and if you go to the website testingbenfordslaw.com you’ll see the proportions of each leading digit from lots of real-world datasets, graphed alongside what Benford’s law predicts they should be, with data ranging from Twitter users’ follower counts to the number of books in different libraries across the US.

Benford’s law doesn’t work perfectly: it only works when you’re examining groups of numbers that span several orders of magnitude. So, for example, for the age in years of the graduate working population, which goes from around twenty to seventy, it wouldn’t be much good; but for personal savings, from nothing to millions, it should work fine. And of course Benford’s law works in other counting systems, so if three-fingered sloths ever developed numeracy, and counted in base 6, or maybe base 12, the law would still hold.

This property of naturally occurring data has been used to check for dubious behaviour in figures for four decades now: it was first used on socioeconomic data submitted to support planning applications, and then on company accounts; it’s even admissible in US courts. In 2009 an economist from the Bundesbank suggested using Benford’s law on countries’ economic data, and last month the results were published (hat-tip to Tim Harford for the paper).

Researchers took macroeconomic data on all twenty-seven EU nations, looking specifically at the accounting data that countries have to hand over for monitoring, which is all posted for free at the online repository Eurostat: things like government deficit, debt, revenue, expenditure, and so on. Then they took just the first digits from all the numbers, and checked to see if they deviated from what you would predict using Benford’s law.

The results were fun. Greece – whose economy has tanked –
showed the largest and most suspicious deviation from Benford’s law of any country in the Euro.

This isn’t a massive surprise: the EU has run several investigations into Greece’s numbers already, and the ones from 2005 to 2008 were repeatedly revised upwards after the fact. But it’s neat, and if you wanted to while away a very nerdy afternoon, could even download the data, for free from Eurostat, and repeat the analysis for yourself. Joy!

The Certainty of Chance

Guardian, 6 September 2008

Britain’s happiest places have been mapped by scientists, according to the BBC: Edinburgh is the most miserable place in the country, and they were overbrimming with technical details on exactly how miserable we are in each area of Britain. The story struck a chord, and was lifted by journalists throughout the nation, as we cheerfully castigated ourselves. ‘Misera-Poole?’ asked the Dorset Echo. ‘No Smiles in Donny’, said Doncaster Today.

From the Bromley Times, through Bexley, Dartford and Gravesham, to the Hampshire Chronicle, everyone was keen to analyse and explain their ranking. ‘Basingstoke lacks any sense of community or heart,’ said Reverend Dr Derek Overfield, industrial chaplain for the area. And so on.

Exactly what kind of data is the good reverend explaining there? The Times had some methodological information: ‘Researchers at Sheffield and Manchester universities based their findings on more than 5,000 responses from the annual British Household Panel Survey.’ According to the BBC it was presented in a lecture at some geographical society. ‘However,’ it said quietly, ‘the researchers stress that the variations between different places in Britain are not statistically significant.’

Here, nestled away, halfway through the gushing barrage of data and facts, was an unmarked confession: this entire news story was based on nothing more than random variation.

There are many reasons why you might see differences between different areas in your survey data on how miserable people are, and people being differently miserable is only one explanation. There might also be, of course, the play of chance: 5,000 people in 274 areas doesn’t give you many in each town – fewer than twenty, in fact – so you might just happen to have picked out more miserable people in Edinburgh, and miss the fact that misery is uniformly distributed throughout the country.

This is called sampling error, and it quietly undermines almost every piece of survey data ever covered in any newspaper. Although the phenomenon has spawned a fiendish area of applied maths called ‘statistics’, the basic principles are best understood with a simple game.

Dr W. Edwards Deming was a charismatic American management guru who railed against performance-related pay on the grounds that it arbitrarily rewarded luck. Working in a theatrical field, he demonstrated his ideas with a simple piece of stagecraft he called the Red Bead Experiment.

Deming would appear at management conferences with a big trough containing thousands of beads which were mostly white, but 20 per cent were red. Eight volunteers were then invited up on stage from the audience of management drones: three to be managers, and five to be workers. ‘Your job,’ Deming explained solemnly, ‘is to make white beads.’

He then produced a paddle with fifty holes cut into it, which was passed to each ‘worker’ in turn. They dipped the paddle into the trough, wiggled it around, and tried to produce as many white beads as they could manage through this entirely random process.

‘Go and show the inspectors,’ Deming would say sternly.

‘Only five red beads, well done! Fourteen red beads? I think we need to re-evaluate your skill set.’ Workers were sacked, promoted, retrained and redeployed, to great amusement.

We ignore basic principles like sampling error at our peril, because the illusion of control, which we all carry around for the sake of sanity, is more powerful than we think, and countless workers have had their lives turned to misery for the simple crime of pulling out fifteen red beads.

Back in the world of misery, were the journalists blameless, and guilty only of ignorance? For any individual, nobody can tell. But Dr Dimitris Ballas, the academic who did the research, has a clue: ‘I tried to explain issues of significance to the journalists who interviewed me. Most,’ he says, ‘did not want to know.’

Sampling Error, the Unspoken Issue Behind Small Number Changes in the News

Guardian, 20 August 2011

What do all these numbers mean? ‘“Worrying” Jobless Rise Needs Urgent Action – Labour’ was the BBC headline. It explained the problem in its own words: ‘The number of people out of work rose by 38,000 to 2.49 million in the three months to June, official figures show.’

There are dozens of different ways to quantify the jobs market, and I’m not going to summarise them all here. The claimant count and the labour force survey are commonly used, and the number of hours worked is informative too: you can fight among yourselves over which is best, and get distracted by party politics to your hearts’ content. But in claiming that this figure for the number of people out of work has risen, the BBC is simply wrong.

Here’s why. The ‘Labour Market’ figures come through the Office for National Statistics, and it has published the latest numbers in a PDF document. See here, top table, fourth row, you will find these figures the BBC is citing. Unemployment aged sixteen and above is at 2,494,000, and has risen by 38,000 over the past quarter (and by 32,000 over the past year). But you will also see some other figures, after the symbol ‘±’, in a column marked ‘sampling variability of change’.

Those figures are called ‘95 per cent confidence intervals’, and these are among the most useful inventions of modern life.

We can’t do a full census of everyone in the population every time we want some data, because they’re too expensive and time-consuming for monthly data collection. Instead, we take what we hope is a representative sample.

This can fail in two interesting ways. Firstly, you’ll be familiar with the idea that a sample can be systematically unrepresentative: if you want to know about the health of the population as a whole, but you survey people in a GP waiting room, then you’re an idiot.

But a sample can also be unrepresentative simply by chance, through something called sampling error. This is not caused by idiocy. Imagine a large bubblegum-vending machine, containing thousands of blue and yellow bubblegum balls. You know that exactly 40 per cent of those balls are yellow. When you take a sample of a hundred balls, you might get forty yellow ones, but in fact, as you intuitively know already, sometimes you will get thirty-two, sometimes forty-eight, or thirty-seven, or forty-three, or whatever. This is sampling error.

Now, normally, you’re at the other end of the telescope. You take your sample of a hundred balls, but you don’t know the true proportion of yellow balls in the jar – you’re trying to estimate that – so you calculate a 95 per cent confidence interval around whatever proportion of yellow you get in your sample of a hundred balls, using a formula (in this case, 1.96 × the square root of ((0.6 × 0.4) ÷ 100)).

What does this mean? Strictly (it still makes my head hurt), it means that if you repeatedly took samples of a hundred, then on 95 per cent of those attempts, the true proportion in the bubblegum jar would lie somewhere between the upper and lower limits of the 95 per cent confidence intervals of your samples. That’s all we can say.

So, if we look at these employment figures, you can see that the changes reported are clearly not statistically significant: the estimated change over the past quarter is 38,000, but the 95 per cent confidence interval is ±87,000, running from –49,000 to 125,000. That wide range clearly includes zero, which means it’s perfectly likely that there’s been no change at all. The annual change is 32,000, but again, that’s ±111,000.

I don’t know what’s happening to the economy – it’s probably not great. But these specific numbers are being over-interpreted, and there is an equally important problem arising from that, which is frankly more enduring for meaningful political engagement.

We are barraged, eve
ry day, with a vast quantity of numerical data, presented with absolute certainty and fetishistic precision. In reality, many of these numbers amount to nothing more than statistical noise, the gentle static fuzz of random variation and sampling error, making figures drift up and down, following no pattern at all, like the changing roll of a dice. This, I confidently predict, will never change.

Scientific Proof That We Live in a Warmer and More Caring Universe

Guardian, 29 November 2008

As usual, it’s not Watergate, it’s just slightly irritating. ‘Down’s births increase in a caring Britain’, said The Times: ‘More babies are being born with Down’s syndrome as parents feel increasingly that society is a more welcoming place for children with the condition.’ That’s beautiful. ‘More mothers are choosing to keep their babies when diagnosed with Down’s Syndrome’ said the Mail. ‘Parents appear to be more willing to bring a child with Down’s syndrome into the world because British society has become increasingly accepting of the genetic abnormality’ said the Independent. “Children’s quality of life is better and acceptance has risen’, said the Mirror.

‹ Prev Next ›