The Tiger That Isn't Page 9 Read online free by Andrew Dilnot

Home > Other > The Tiger That Isn't > Page 9

The Tiger That Isn't Page 9

Social and political life is as rich and subtle as our own, and every bit as resistant to caricature by a single objective defined with a single measurement. If you want to summarise like this, you have to accept the violence it can do to complexity.

This is why targets struggle. They necessarily try to glimpse a protean whole through the keyhole of a single number. The strategy with targets is like that for the average: think about what they do not measure, as well as what they do, think about what else lies beyond the keyhole.

A consultant was asked to come to casualty urgently. She hurried down to find a patient about to breach the four-hour target for trolley waits. She also found another patient, who seemed in more pressing need, and challenged the order of priority. But no, they wanted her to leave that one: 'He's already breached.'

When we asked people to email More or Less with examples of their personal experience of 'gaming' in the NHS – or what is sometimes called 'hitting the target but missing the point' – we had little idea of the extent, or variety.

Here is another: 'I work in a specialist unit. We're always getting people sent to us by Accident and Emergency (A &E) because once they are referred the patient is dealt with as far as they are concerned and has met their target. Often, the patients could have been seen and treated perfectly well in A & E. But if they were sent to us, they might be further away from home and have to wait several days to be treated because they are not, and frankly never were, a priority for a specialist unit.'

And another: 'I used to work for a health authority where part of my job was to find reasons to reset the clock to zero on people's waiting times. We were allowed to do this if someone turned down the offer of an appointment, for example. But with the waiting-time targets, we began working much harder to find opportunities to do it, so that our waiting times would look shorter.'

Why do targets go wrong? The first reason has to do with their attempt to find one measurable thing to represent the whole. An analogous problem appears in the Indian fable of the Blind Men and the Elephant. The best-known Western version is by an American poet, John Godfrey Saxe (1816–87):

It was six men of Indostan

To learning much inclined,

Who went to see the Elephant

(Though all of them were blind),

That each by observation

Might satisfy his mind

But their conclusions depended entirely on which bit of the elephant they touched, so they decided, separately, that the elephant was like a wall (its side), a snake (trunk), a spear (tusk), a tree (leg), a fan (ear), or a rope (tail).

And so these men of Indostan

Disputed loud and long,

Each in his own opinion

Exceeding stiff and strong,

Though each was partly in the right,

And all were in the wrong!

That whole elephant is a pig to summarise. A single measure of a single facet leaves almost everything useful unsaid, and our six men still in the dark. In life, there are many elephants.

But the problem is worse than that. In health and education (two of the biggest), it is not only that one part doesn't adequately represent the whole. There is also the tendency for the parts we do not measure to do odd things when our backs are turned: while measuring the legs, the trunk starts to misbehave. And so there have been occasions in healthcare when the chosen measure was the number of patients surviving an operation, with the result that some surgeons avoided hard cases (who wants a dead patient spoiling the numbers?). At least if they died without reaching the operating table, they didn't risk dying on it. So part of the elephant, unseen or at least unmeasured, was left to rot, even as the measured part told us the elephant was healthy.

To be partly in the right is often the best that targets can do, by showing only part of the picture. The ideal, obviously, is to show us the whole elephant, but it is no slight on numbers to say that they rarely can. Hospital waiting lists or ambulance response times are worth measuring but, even if the numbers are truthful – and we will see shortly why they are often not – they are inevitably grossly selective, and say nothing at all about the thing we care for most, which is the quality of care. Yes, patients were seen quickly, but did they die at the hands of inadequately trained personnel, whose job it was to arrive at the scene quickly in order to meet the target but who lacked the skills of a capable paramedic? Ambulance trusts really have done such things.

Targets, and their near allies performance indicators, face just such a dilemma. One measurement is expected to stand as the acid test, one number to account for a wide diversity of objectives and standards, while the rest … out of sight, undefined, away from scrutiny, unseen through the keyhole, who cares about the rest?

In a famous cartoon mocking old-style Soviet central planning and target setting, the caption told of celebrations after a year of record nail production by the worker heroes of the Soviet economy, and the picture showed the entire year's glorious output: a single gigantic nail; big, but big enough for what? Measurements can be a con unless they are squared up to their purpose, but it is a con repeatedly fallen for.

Monomania, including mono-number mania, is potentially dodgy in any guise. The best strategy with targets, and indeed with any single-number summary, is to be clear not only what they measure, but what they do not, and to perceive how narrow the definition is. So when a good health service is said to mean short waiting times, and so waiting times are what is measured, someone might stop to ask: 'and is the treatment at the end any good?'

But quality of healthcare tends not to be measured because no one has worked out how to do it with any subtlety – yet. So we are left with a proxy, the measurement that can be done rather than the one that ideally should be done, even though it might not tell us what we really want to know, a poor shadow that might look good even though quality is bad, or vice versa – right in part, and also wrong.

In October 2006 on More or Less we investigated behaviour around the four-hour target limit for trolley waits in casualty, and found evidence to suggest that hospitals were formally admitting people (often involving little more than a move down the corridor to a bed with a curtain), in some cases solely in order to stop them breaching the limit. That is, they were not really being treated inside the waiting target, it just looked that way. Sometimes this practice was probably clinically justified, and it was genuinely preferable for patients to be somewhere comfortable while investigations were completed. In other cases the patients were out again fifteen minutes after being admitted, feeding the suspicion that admittance had been a gaming strategy, not a clinical need. A massive increase in what are called zero-night-stay admissions added to the suspicion that this practice was so widespread as to be partly the result of playing the system to meet the target. While the number of people arriving at A & E between 1999/2000 and 2004/5 went up 20 per cent, the number admitted went up 40 per cent.

None of this is illegal; to some extent it is perfectly rational for those taking the decisions, under pressure in a busy A &E department, to respond to the incentives they are given. If those doctors and nurses simply cannot see everyone who has come to A & E in the four-hour limit and will be penalised if they miss that target, admitting them to a ward seems the next best thing. That way, they will be seen (eventually), and the A & E department hits its four-hour target to see or admit. But the problem has worsened since the government introduced a system of payment by results. Now, every time a hospital admits someone from A & E, the hospital is paid £500. So the gain from admitting someone from A & E is no longer just that it helps meet the four-hour target, but also raises funds for the hospital. More funds for the hospital means less somewhere else in the NHS, and if the admissions are in truth unjustified, this implies a serious misallocation of resources.

Since our report, the Information Centre for Health and Social Care, NHS data-cruncher in chief, has said at a public conference that we caused it to launch its own investigation, despite
Department of Health denials that there was any kind of problem to be investigated. A department official told us that the changes revealed by these statistics represented good clinical practice. If so, it is a practice the government does not want too much of, since the rise from year to year in 0-day admissions that it is prepared to pay for has now been capped – a strange way of showing approval. We have learned that one hospital has agreed to cut the fee it charges to primary care trusts for such admissions, and another is renegotiating.

According to Gwyn Bevan, Professor of Management Science at the London School of Economics, and Christopher Hood, Professor of Government and Fellow of All Souls College Oxford, the current faith in targets rests on two 'heroic' assumptions.

The first is the elephant problem: the parts chosen must robustly represent the whole, a characteristic they call 'synecdoche', a figure of speech. For example, we speak of hired hands, when we mean, of course, the whole worker. The second heroic assumption is that the design of the target will be 'game proof'.

The difficulties with the second follow hard on the first. Because it is tough to find the single number to stand for what we want out of a complex system delivering multiple objectives (just as it is in life), because one part seldom adequately speaks for all, it gives licence to all kinds of shenanigans in the parts that are untargeted, out of sight and often out of mind. So if your life did happen to be judged solely on your income, with no questions asked about how you came by it, and if there was no moral imperative to hold you back, 'gaming' would be too gentle a word for what you might get up to at the bank with a stocking mask and shotgun.

Bevan and Hood have catalogued a number of examples of targets in the health service that aimed at something which seemed good in itself, but which pulled the rug on something else. The target was hit, the point was missed, damage was done elsewhere.

In 2003, the Public Administration Select Committee found that the waiting-time target for new ophthalmology out-patient appointments had been achieved at one hospital by cancelling or delaying follow-up appointments. As a result, over two years at least twenty-five patients were judged to have gone blind.

In 2001 the government said all ambulances should reach a life-threatening emergency (category A) within eight minutes, and there was a sudden massive improvement in the number that did, or seemed to. But what is a 'life-threatening emergency'? The proportion of emergency calls logged as category A varied fivefold across trusts, ranging from fewer than 10 per cent to more than 50 per cent. It then turned out that some ambulance services were doctoring their response times – lying, to put it plainly. They cheated on the numbers, but it was also numbers that found them out, when it was discovered that there was a suspiciously dense concentration of responses recorded just inside eight minutes, causing a sudden spike in the graph, and almost no responses just above eight minutes. This was quite unlike the more rounded curve of response times that one would expect to see, and not a pattern that seemed credible. The chart shows the pattern that led to suspicion that something was amiss. There was even some evidence that more urgent cases were sometimes made to wait behind less urgent (but still category A) cases, to meet the targets.

Figure 8 Ambulance response times

The new contract for GPs, which began in 2004, rewards them for, among other things, ensuring that they give patients at least ten minutes of their time (a good thing, if the patients need it). But that also gives an incentive to spin out some consultations that could safely be kept short, 'asking after Aunty Beryl and her cat', as one newspaper put it. The creditable aim was to ensure people received proper attention. The only practical way of measuring this was with time, and so a subtle judgement was summarised with a single figure, that became a target, that created incentives, that led to suspicions of daft behaviour.

On waiting times in A & E, where there has been a sharp improvement in response to a target, but ample evidence of misreporting, Bevan and Hood concluded: 'We do not know the extent to which these were genuine or offset by gaming that resulted in reductions in performance that was not captured by targets.' Gwyn Bevan, who is, incidentally, a supporter of targets in principle and has even helped devise some, told us that when a manager worked honestly but failed to reach a target, then saw another gaming the system, hitting the target and being rewarded, the strong incentive next time would be to game like the rest of them: bad behaviour would drive out good.

Some people respond to a system of performance measurement by genuinely improving their performance, but others react by diverting energy into arguing with it, some ignore it, some practice gaming, finding ways, both ingenious and crude, to appear to be doing what's expected but only by sleight of hand, and some respond with downright lies about their results.

Bevan and Hood identify four types. There are the 'saints', who may or may not share the organisation's goals, but their service ethos is so high that they voluntarily confess their shortcomings. There are the 'honest triers' who won't incriminate themselves, but do at least get on with the job without resorting to subterfuge. The third group they call the 'reactive gamers', who might even broadly agree with the goals but, if they see a low-down shortcut, take it. Finally, there are what Bevan and Hood call 'rational maniacs', who would spit on the whole system, but do what they can to conceal their behaviour, shamelessly manipulating the data.

Given these and other complications, good numbers could be telling us any of four things. 1: All is well, performance is improving and the numbers capture what's going on. 2: The numbers capture what's going on in the parts we're measuring but not what's happening elsewhere. 3: performance as measured seems fine, but it's not what it seems because of gaming. 4: The numbers are lies.

But how do we tell which one applies in any case? Here is the nub of the problem: too often we can't be sure.

In America too, over the years, there has been a long list of attempts to measure (and therefore improve) performance, often with financial incentives, that have somehow also returned a kick in the teeth. The following are just a few examples from studies published in various medical journals.

In New York State it was found that reporting of cardiac performance had led to reluctance to treat sicker patients and 'upcoding' of comorbidities (exaggerating the seriousness of the patient's condition). This made it look as if the surgery was harder, so that more patients might be expected to die, meaning that performance looked impressive when they didn't.

More than 90 per cent of America's health plans measure performance using a system called HEDIS (Healthcare Effectiveness Data and Information Set). This consists of seventy-one measures of care. In 2002 it was found that, rather than improve, some poor performers had simply ceased publishing their bad results.

In the 1990s, a prospective payment system was introduced for Medicare. This paid a standard rate for well-defined medical conditions in what were known as diagnosis-related groups (DRGs). The system effectively set a single target price for each treatment for all healthcare providers, thus encouraging them to bring down their costs, or so it was thought.

The effect was satirised in an article in the British Medical Journal in 2003 that feigned to offer advice. 'The prospective payment system in the United States … has created a golden opportunity to maximise profits without extra work. When classifying your patient's illness, always 'upcode' into the highest treatment category possible. For example, never dismiss a greenstick fracture as a simple fracture – inspect the x-ray for tiny shards of bone. That way you can upgrade your patient's break from a simple to a compound fracture and claim more money from the insurance company. 'DRG creep' is a well-recognised means of boosting hospital income by obtaining more reimbursement than would otherwise be due.'

The article added that a national survey of US doctors showed 39 per cent admitted to using tactics – including exaggerating symptoms, changing billing diagnoses, or reporting signs or symptoms that patients did not have – to secure additional services felt to be clinicall
y necessary. The scope for such behaviour has been reduced in the years since, but not eliminated.

The key point from this mountain of evidence is that when we use numbers to try to summarise performance, all this, and more, will be going on in the background, changing even as we attempt to measure it, changing because we attempt to measure it. Can numbers reliably capture the outcome of this complexity? In essence, we witness in targets an unending struggle between the simplicity of the summary and the complexity (and duplicity) of human behaviour.

In health data there's yet another twist. A paper published in 2007 by Rodney A. Hayward, from the University of Michigan, points out that performance measures for healthcare are often agreed after high-stakes political arguments in which anyone with an interest in an illness advocates idealised standards of treatment. They want, naturally, more and more resources for their cause and press for care that may be only marginally beneficial. They fight for standards on behalf of the most needy cases.

But this entirely understandable advocacy for idealised standards of treatment for, say, diabetes takes no account of the demands for treatment of, say, alzheimers. Performance measurement against a set of such standards can make it appear that everyone is failing.

Hayward comments: 'It sounds terrible when we hear that 50 per cent of recommended care is not received, but much of the care recommended by subspecialty groups is of modest or unproven value, and mandating adherence to these recommendations is not necessarily in the best interests of patients or society … Simplistic all-or-nothing performance measures can mislead providers into prioritising low-value care …'

‹ Prev Next ›