by Ben Goldacre
The economic loss is almost impossible to measure: if any of the projects I’ve already described sound trivial to you, remember that this is a crippled field, where innovators have barely had a chance to get their eyes in. Amazing things happen when you pull individual pieces of information together into larger linked datasets: meaning emerges, as you produce facts from figures. If you’ve ever wished you were born in the nineteenth century, when there were so many obvious inventions and ideas to hook for yourself, then I seriously recommend you become a coder, because future nerds will look back on this time with the exact same envy. But that leap forward will be tediously retarded if we don’t make the government allow us to use the pavements.
Care.data Can Save Lives: But Not If We Bungle It
Guardian, 21 February 2014
Everything would be much simpler if science really was ‘just another kind of religion’. But medical knowledge doesn’t appear out of nowhere, and there is no ancient text to guide us. Instead, we learn how to save lives by studying huge datasets on the medical histories of millions of people. This information helps us identify the causes of cancer and heart disease; it helps us spot side effects from beneficial treatments, and switch patients to the safest drugs; it helps us spot failing hospitals, or rubbish surgeons; and it helps us spot the areas of greatest need in the NHS. Numbers in medicine are not an abstract academic game: they are made of flesh and blood, and they show us how to prevent unnecessary pain, suffering and death.
All this vital work is now being put at risk by the bungled implementation of the care.data project. It was supposed to link all NHS data about all patients together into one giant database, like the one we already have for hospital episodes; instead it has been put on hold for six months, in the face of plummeting public support. It should have been a breeze. But we have seen arrogant paternalism, crass boasts about commercial profits, a lack of clear governance, and a failure to communicate basic science properly. All this has left the field open for wild conspiracy theories. It would take very little to fix this mess, but time is short, and lives are at stake.
The care.data project was promoted in two ways: we will use your data for lifesaving research; and we will give it to the private sector for commercial exploitation, creating billions for the UK economy. This marriage was a clear mistake: by and large, the public support public research, but are nervous about commercial exploitation of their health data.
Now the teams behind care.data are trying to row back, explaining that access will only be granted for research that benefits NHS patients. That is laudable, but it is potentially a very broad notion. It’s one we would want to unpack, with clear, worked examples of the kind of things that would be permitted, and the kind of things that would be refused. But that’s not possible because, bizarrely, the specific principles, guidelines, committees and regulations that will determine all these decisions have not yet been clearly set out. This poses several difficulties. Firstly, the public are being asked to support something that feels intuitively scary, about the privacy of their medical records, without being told the details of how it will work. Secondly, the field has been left open to conspiracy theories, which are hard to refute without concrete guidance on how permissions for access really will work.
That said, many criticisms have been absurd. There has been endless discussion around the idea of health insurers buying health records, for example, and using them to reject high-risk patients. Call an insurer right now and see how you get on: within minutes you will be asked to declare your full medical history, waive confidentiality and grant access to your full medical notes anyway.
Many have complained about drug companies getting access to data, and this is more complex. On the one hand, arrangements like these are long-standing and essential: if medicines regulators get a few unusual side-effect reports from patients, they go to the drug company and force them to do a big study, examining – for example – 10,000 patients’ records, to find out if people on that drug really do have more heart attacks than we’d expect. To do this, the UK health regulator itself sells industry the data, in the past from something called the GP Research Database, which holds millions of people’s records already. This needs to happen, and it’s good.
But equally, people know – I’ve certainly shouted about it for long enough – that pharmaceutical companies also misuse data: they hide the results of clinical trials when it suits them, quite legally; they monitor individual doctors’ prescribing patterns to guide their marketing efforts, and so on. The public don’t trust the pharmaceutical industry unconditionally, and they’re right not to.
Trust, of course, is key here, and that’s currently in short supply. The NSA leaks showed us that governments were casually helping themselves to our private data. They also showed us that leaks are hard to control, because the National Security Agency of the wealthiest country in the world was unable to stop one young contractor stealing thousands of its most highly sensitive and embarrassing documents.
But there is a more specific reason why it is hard to give the team behind care.data our blind faith: they have been caught red-handed giving false reassurance on the very real – albeit modest – privacy threats posed by the system.
Tim Kelsey is the man running the show: an ex-journalist, passionate and engaging, he has drunk more open-data Kool-Aid than anyone I’ve ever met. He has evangelised the commercial benefits of sharing NHS data – perhaps because he made millions from setting up a hospital-ranking website with Dr Foster Intelligence – but he is also admirably evangelical about the power of data and transparency to spot problems and drive up standards. Unfortunately, he gets carried away, stepping up and announcing boldly that no identifiable patient data will leave the Health and Social Care Information Centre. Others supporting the scheme have done the same.
This is false reassurance, and that is poison in medicine, or in any field where you are trying to earn public trust. The data will be ‘pseudonymised’ before it is released to any applicant company, with postcodes, names and birthdays removed. But re-identifying you from that data is more than possible. Here’s one example: I had twins last year (it’s great; it’s also partly why I’ve been writing less). There are 12,000 dads with similar luck each year; let’s say 2,000 in London; let’s say one hundred of those are aged thirty-nine, like me. From my brief online bio you can work out that I moved from Oxford to London in about 1995. Congratulations: you’ve now uniquely identified my health record, without using my name, postcode, or anything ‘identifiable’. Now you’ve found the rows of data that describe my contacts with health services, you can also find out if I have any medical problems that some might consider embarrassing: incontinence, perhaps, or mental health difficulties. Then you can use that information to try to smear me: a routine occurrence if you do the work I do, whether it’s done by big drug companies or dreary little quacks.
This risk isn’t necessarily big, but to say it doesn’t exist is crass: it’s false reassurance, which ultimately undermines trust. It’s also unnecessary and counterproductive, like hiding information on side effects instead of discussing them proportionately. To the best of my knowledge, we’ve never yet had a serious data leak from a medical research database, and there are plenty around already; but then, we are standing on the verge of a significant increase in the number of people accessing and using medical data. There are steps we can take to minimise the risks: only release a subset of the 60 million UK population to each applicant; only give out the smallest possible amount of information on each patient whose records you are sharing; suggest that people come to your data centre to run their analyses, instead of downloading records, and so on. But, while the care.data project might be planning to do some of those things, the ground rules haven’t been properly written out yet.
In any case, even safeguards such as these can be worked around. There are companies out there operating in the grey areas of the law, aggregating data from every source and leak they can find, gen
erating huge, linked datasets with information from direct-marketing lists, online purchases, mobile-phone companies and more. Who’s to know if someone will start quietly aggregating all the small chunks of our health data?
This, of course, would be illegal. As Tim Kelsey and others are keen to point out, re-identifying or leaking data in any way would be a ‘criminal offence’. But as this project lands, we’re all becoming rapidly aware that incompetence, malice and creepiness around confidential data are policed with a worryingly light touch. Private investigators have little trouble obtaining confidential data from staff in the police force, banks and tax offices, for example.
Here’s why: it took a long time for anyone to realise that Steve Tennison, a finance manager in a GP practice, had accessed patients’ records on 2,023 occasions over the course of a year, although it was relevant to his work on only three occasions. The majority of the records he snooped on belonged to young women: he repeatedly accessed the records of one woman he had gone to school with, and also the records of her son. The maximum penalty for this is a fine, with a ceiling of £5,000 in magistrates’ courts. Tennison was fined £996 in December 2013. This is why the public feel nervous, and this is what we need to fix.
It’s painful for me to write critically about a project like care.data, because I love medical data, and I know the good it can do. We have a golden opportunity in the UK, with 60 million people cared for in one glorious NHS. Opt-outs would destroy the data, and the growing calls for an opt-in system would be worse: opt-in killed people by holding back organ donation, and more than that, it would exacerbate social inequality around data, because the poorest patients, who are the most likely to be unwell, are also the least engaged with services, the least likely to opt in. They would become invisible.
So here’s my advice: if you’re thinking of opting out – wait. If you run care.data – listen. There are three things the government can do to rescue this project.
Firstly, make a proper announcement about what you will do during the six-month delay. You cannot rely on blind trust when it comes to sharing private medical records, so explain that you’ll be coming back soon with a clear story. Sort out the governance framework, present unambiguous rules and principles explaining how data will be shared, list the specific clinical codes you’re proposing to upload, then give real-world examples of the kind of access applications that would be approved, and the kind that would be rejected. This is fair, and sensible.
Secondly, show the public how lives are saved by medical research. This needs examples from the vast archives of medical research on cancer, heart disease and more. Alongside that, give a clear nod to the small risks, and an explanation of how they will be mitigated. Never be seen to give false reassurance on these risks; if you do, you will lose patients’ trust forever.
Lastly, we need stiff penalties for infringing medical privacy, on a grand and sadistic scale. Fines, like parking tickets, are useless for individuals and companies: anyone leaking or misusing personal medical data needs a prison sentence, as does their CEO. Their company – and all its subsidiaries – should be banned from accessing medical data for a decade. Rush some test cases through, and hang the bodies in the town square.
If the government does all this, it has a good chance of saving a vital data project, and permitting medical research that saves vast numbers of lives to continue. If the government tries to fudge – with half measures, superficial PR and false reassurance – then care.data will fail, and it might well bring down other sensible public health research with it. Lives are at stake. This cannot be left to the last minute in the six-month pause, and time is precious. It’s February. If you’re thinking of opting out, please don’t. But mark your diary for May.
Care.data Has Been Bungled
Guardian, 28 February 2014
I am embarrassed. Last week I wrote in support of the government’s plans to collect and share the medical records of all patients in the NHS, albeit with massive caveats. The research opportunities are huge, but we already knew that the implementation was chaotic, with poor public information, partly because the checks and balances on who gets access to data – and how – have not yet been devised or implemented. When you’re proposing to share our most private medical records, vague promises and an imaginary regulatory framework are not reassuring.
Now it’s worse. On Monday, the Health and Social Care Information Centre admitted giving the insurance industry the coded hospital records of millions of patients, pseudonymised, but re-identifiable by anyone with malicious intent, as I explained last week. These were crunched by actuaries into tables showing the likelihood of death, depending on various features such as age or disease, to help inform insurance premiums.
We can reasonably disagree on whether you find this use of your medical records acceptable, but the process must be competent and transparent. The HSCIC has now told the BBC that this release of your medical records broke the rules, and that there may have been other similarly erroneous releases; but it won’t say more until ‘later this year’.
On Tuesday, at a Health Select Committee hearing, things got worse. The HSCIC said it couldn’t share documentation on these releases because it had all been done by its predecessor body, the NHS Information Centre – even though the HSCIC replaced the NHSIC in 2013, and is in the same building, doing the same job, with almost identical personnel and all the old records. Furthermore, the actuaries’ report using the hospital data carries the HSCIC’s logo – not the old NHSIC one – with the HSCIC’s admitted full consent. If HSCIC disapproves of NHSIC releasing this data – or regards it as illegal – why did it add its logo and approval to the output?
Also, is it really true that release to the insurance industry is unacceptable to the HSCIC? Its own information governance assessment from August says that access to individual patients’ records can ‘enable insurance companies to accurately calculate actuarial risk so as to offer fair premiums to its [sic] customers. Such outcomes are an important aim of Open Data, an important government policy initiative.’ Is that document binding? What are the rules? Are there previous dodgy data-sharing arrangements, agreed by the NHSIC, that the HSCIC is still honouring, with data still flowing out of the building?
This is chaos. Then, on Thursday, to make things worse, Public Health Minister Jane Ellison appears to have misled Parliament, telling it that the data released by the HSCIC was ‘publicly available, non-identifiable and in aggregate form’. This is utterly untrue. It was line-by-line data – every individual hospital episode, for every individual patient, with unique pseudonymous identifiers – which was then aggregated into summary tables by the actuaries.
To summarise, a government body handed over parts of my medical records to people I’ve never met, outside the NHS and the medical research community, but it is refusing to tell me what it handed over, or who it gave it to, and the Minister is now incorrectly claiming that it never happened anyway.
There are people in my profession who think they can ignore this problem. Some are murmuring that this mess is like MMR, a public misunderstanding to be corrected with better PR. They are wrong: it’s like nuclear power. Medical data, rarefied and condensed, presents huge power to do good, but it also presents huge risks. When leaked, it cannot be unleaked; when lost, public trust will take decades to regain.
This breaks my heart. I love big medical datasets, I work on them in my day job, and I can think of a hundred life-saving uses for better ones. But patients’ medical records contain secrets, and we owe them our highest protection. Where we use them – and we have used them, as researchers, for decades without a leak – this must be done safely, accountably and transparently. New primary legislation, governing who has access to what, must be written: but that’s not enough. We also need vicious penalties for anyone leaking medical records; and the HSCIC needs to regain trust, by releasing all documentation on all past releases, urgently. Care.data needs to work: in medicine, data saves lives.
 
; The care.data programme was suspended shortly after this piece was published, with the promise that they’d have a think and relaunch in six months. Six months have already passed, and there has been no relaunch. I’m on their Advisory Group and continue to shout about the issues raised above, indoors and out. Medical data can save lives, but if the single biggest project ever conceived on patient records is not handled properly, we risk destroying public trust for all such projects, not just care.data.
SURVEYS
The Huff
Guardian, 19 January 2008
In 1954 a man called Darrell Huff published a book called How to Lie with Statistics. Chapter 1 is called ‘The Sample With Built-In Bias’, and it reads exactly like this column, which I’m about to write, on a Daily Telegraph story in 2008.
Huff sets up his headline: ‘The Average Yaleman, Class of 1924, Makes $25,111 a Year!’ said Time magazine, half a century ago. That figure sounded pretty high: Huff chases it, and points out the flaws. How did they find all these people they asked? Who did they miss? Losers tend to drop off the alma mater radar, whereas successful people are in Who’s Who and the College Record. Did this introduce ‘selection bias’ into the sample? And how did they pose the question? Can that really be salary rather than investment income? Can you trust people when they self-declare their income? Is the figure spuriously precise? And so on.