by Marty Makary
Unfortunately, some physicians believe that a lack of a randomized controlled trial means there’s no evidence. That sloppy and dangerous thinking gets worse when the medical community conflates “no evidence” with “not true.” That’s a logical fallacy. The term “no evidence to support” actually means one of two things: it’s been studied and evidence does not support it, or it has not been studied and could be true. The liberal use of “no evidence to support” has conditioned us to distrust anything not supported by trial. I’ve taught my students and residents to do better, replacing the sloppy phrase “There is no evidence” with either “It is unknown because it has not been adequately studied” or “It has been studied adequately and has not been shown to be effective.”
The concept behind Improving Wisely is to apply the wisdom of expert doctors to identify practice patterns that appear inappropriate. When I show the practice pattern data to doctors and they see outliers, they say “I get it.” No trial necessary.
Engaging Doctors
Over the next year, as I had cafeteria and hallway conversations with my Johns Hopkins colleagues, I’d inquire about overuse in their particular specialty. Then I would ask whether the overuse could be measured as a pattern. Most of the time, the doctors had an immediate response. They often started with practices targeted by the Choosing Wisely project. But most of those can’t be measured in big data because most national data lack the needed granularity. After more conversation, I could usually pique their interest and develop some sort of measure with them. For a time, I asked these same questions of almost every doctor I ran into, including at hospitals where I spoke as a visiting professor and at national medical conferences. My list of overused medical practices started growing. Insider insights shared with me were both fascinating and alarming.
In talking with my colleagues who do breast surgery, for instance, they told me that some surgeons have very high rates of calling patients back after a lumpectomy for a reexcision. I had Peiqi, my analyst, pull the national data, and what we found was remarkable. While most surgeons have a reexcision rate below 20% (a number that my colleagues thought was a reasonable boundary of what should be considered acceptable), nearly one in seven breast surgeons have a reexcision rate over 30%. Here’s the data, a distribution of all U.S. surgeons who perform more than ten lumpectomy procedures per year on Medicare patients (see following graph). What was surreal was to see the actual names of the outlier doctors (see following chart). Tragically, they were getting paid a lot more for being outliers.
We presented this work at a leading surgical conference, the Southern Surgical Association,2 which generated tremendous interest, along with frustration that these wide variations in quality persist.
GI
I often work with gastroenterology (GI) doctors, so naturally I began to ask them about areas of overtreatment and waste in their specialty. They unloaded a treasure trove of areas to measure patterns of overuse. They told me of the hemorrhoid banding procedure, in which a doctor wraps a rubber band around the base of a hemorrhoid to cut off its blood supply. The bands shouldn’t be applied in more than about 10% of cases. But some doctors band every hemorrhoid they can get their hands on. When I asked why they do it so often, the GI docs responded with a comment I started hearing a lot: “It pays well.” Within days, I showed the GI doctors the data supporting their suspicion: a fraction of doctors performed hemorrhoid banding on nearly every patient they evaluated. And yes, it hurts to write about it.
Another GI colleague, Dr. Eun Ji Shin, told me some doctors maximize their billing by spreading out over two separate days two procedures that should be done at the same time. It’s common for non-urgent patients with stomach complaints or severe heartburn to need both an upper endoscopy and a lower colonoscopy. Whenever possible, their doctor should do both procedures in sequence while they are sedated. But Dr. Shin said there were doctors who game the system by scheduling the two procedures on different days. When the doctor owns or co-owns the procedure center, they can make much more money that way. Of course, sometimes doing it in two procedures is the right thing to do, but he explained how a pattern would uncover those playing the game.
Taking this information, I went back to the database. As Dr. Shin predicted, I discovered that most doctors combine the procedures, as they should. The average doctor does the procedures over two days only about 18% of the time.3 But that’s misleading. At bigger institutions, the average is 13%. At smaller, privately owned endoscopy centers, the average is 24%. Then we found the outliers. A small group of GI doctors performed the procedures on separate days every time! And a bunch more did it on two different days more than half the time—a threshold GI experts called indefensible. This not only creates a lot of hassle and expense for patients, it’s risky, because the patient must go under anesthesia a second time.
As my GI colleagues and I began to draft a research article describing our findings, we invited two GI doctors who were new to our faculty to review the study. I noticed them having a side conversation as if something was wrong. I stopped and asked them if they had additional insights. It turns out they were relieved to learn that their department chair, a coauthor of the study, did not expect them to maximize billing by doing the procedures on different days.
“We don’t do that at this hospital,” one of their new GI colleagues told them. “We do what’s best for the patient.”
“Whew,” one of the new doctors said. “Where we came from, we were expected to separate all upper and lower endoscopy procedures into different days. We assumed the same was true here.”
We laughed, but this was gallows humor, based on a shared recognition that our health care system was corrupt and we were all part of it.
I was shocked to hear how common it was to break up the procedures and how accepted it had become. Thankfully, that day both doctors were liberated from the different-day endoscopy game. If that was something that needed clarifying at Johns Hopkins, it could be happening anywhere.
Cardiac
Other specialists also gave me additional leads to expand the Improving Wisely project. My heart surgeon colleagues down the hallway told me about operations on the mitral valve, which allows blood to flow from one chamber of the heart, the left atrium, into another chamber, the left ventricle. When the mitral valve malfunctions, they can either replace it or repair it with scissors and stitches. Repairing the mitral valve is a much better option for patients when possible. Among other benefits, patients are spared the need to take expensive and risky blood thinners for the rest of their lives. But deciding to do a repair requires a heat-of-the-moment decision, since inspecting the valve during surgery is part of the process. The cardiac surgeons told me it’s possible to repair the valve in up to half the cases. But some of their colleagues take the one-hammer approach and replace them all.
Children
In pediatric surgery, the surgeons told me about an outdated practice of routinely operating on one- and two-year-old kids who happen to have a small belly button hernia, what lay people call an “outie” and what we call an umbilical hernia. Best practices in the specialty have matured. The vast majority of these hernias close on their own as the child grows. It’s recommended that surgeons wait until a kid turns six or seven. On top of that, new research has found that general anesthesia in young kids can be associated with learning disabilities. Bottom line: there are only rare cases when a surgeon needs to close an umbilical type hernia in a child under age four. (Inguinal hernias are a different matter.)
After a robust discussion with Dr. Mehul Raval and other pediatric surgeons, we created a measure to capture the inappropriate pattern of operating on kids too early. The metric was simple. We decided to look at the proportion of all elective umbilical hernia operations a surgeon performs on children under age four. It should be rare, less than 10%. But the data showed that for about one in five surgeons, doing the operation on kids under age four was the rule, not the exception. Without looking at patter
ns, a reviewer would not be able to discern from the patient’s record whether or not it was unnecessary. Each case would have documentation of soft criteria for surgery, such as abdominal pain.
End of Life
In cancer care, the oncologists recommended measuring the proportion of cancer deaths in an oncologist’s practice when the patient was receiving chemotherapy or radiation in the two weeks prior to their death. If an oncologist had 10 or 20% of their patients die while on chemo or radiation, it could mean the patients’ deaths weren’t anticipated. But if 80 to 100% of an oncologist’s patients received chemo or radiation within two weeks of dying, the doctor may not be exercising good judgment about when to back off the aggressive treatment in a case that’s past hope.
Dental Care
In dentistry, silver diamine fluoride drops were found in a large study to stop cavities. The drops can be reapplied as needed and can replace the need to drill a tooth and put in a filling. The only side effect is that it can darken the tooth, even turn it black if reapplied multiple times. That’s not good for baby Zoolanders, but it might spare a child the trauma of drilling. I offered both options to my eight-year-old nephew when his dentist recommended drilling, and guess what? He chose the drops.
I asked dentists about the ethics of putting kids through drilling when the drops were effective. Most of them downplayed the silver drop therapy. Other dentists said it’s highly effective, well proven, and ideal when the teeth are going to be falling out anyway. They also said it’s vastly underused because it’s a threat to the lucrative business of drilling. My team worked with a group of dentists to construct the measure: What proportion of children does a dentist treat with silver diamine fluoride versus drill? If a dentist drills every cavity and never applies silver diamine fluoride, that dentist is probably not presenting the drops option to patients. Dental procedures for cavities are a common Medicaid expenditure. Silver diamine fluoride costs $109, about a quarter to half the price of a filling, depending on who’s doing it.
What It All Means
These appropriateness measures have implications for the overall cost of health care. Operations cost thousands of dollars, endoscopy procedures are common, and chemotherapy costs an arm and a leg. Medicare pays for a lot of it.
The more doctors I sat down with, the more I discovered that overtreatment penetrates most corners of medicine. Many procedure suggestions I got from medical professionals were not measurable. But the ones we could measure were telling. Many of the areas of overuse are tests and procedures that generate money. Our growing list of practice pattern measures largely focused on expensive items. Given the broad interest in lowering health care costs, the Improving Wisely metrics got popular fast. Health care organizations started calling me asking if I could run the algorithms on their data. I learned secondhand that some people called the algorithms “waste metrics.” I prefer “appropriateness metrics,” since that was the spirit of what we were trying to measure. After thousands of sit-down conversations and follow-up chats with specialists and subspecialists, I laid out and validated more than a hundred metrics (of which I have been able to publish only a few in the medical literature because of the slow pace of medical journals in reviewing and publishing articles).
Within a year, the national demand for the measures outpaced my ability to develop them. In one case, a health care organization asked my team to run the appropriateness metrics on an orthopedic group it wanted to acquire. Before sticking their brand on the practice, they wanted to know whether the group did a lot of unnecessary surgery.
A Wide Impact
I decided it was time to recruit help. My surgical practice had a great staff, but we were fielding more calls than we could handle. I remembered the two previous times in my career when the phones were ringing off the hook like this. The first was when Bryan Sexton, Peter Pronovost, and I published a hospital employee safety culture survey and demonstrated how it can be used by hospital leaders.4 At that time, Sexton handed the survey to an entrepreneur on a silver platter to free himself of the logistics. Pascal Metrics, a D.C.-based health care software company, took over the process and it continued to grow. I stayed financially independent so I could be free to advocate for the survey.
The other time our phones rang off the hook was when I published the first article describing a surgery checklist in the medical literature.5,6,7,8 Similarly, I was able to hand it off to the World Health Organization, which used it as the basis for its surgery checklist.9 The WHO had a good platform and asked me to help adapt my checklist into a formal version with the WHO stamp on it. Similarly, I remained financially independent in order to promote the checklist free of bias.
Looking for help in meeting the nationwide demand for the metrics, I called Jim Fields, a friend at the Chicago-based consulting firm Oliver Wyman. Jim had heard me speak on my early work on measuring appropriateness and saw its potential. Jim was the right guy to partner with because he was the real deal. He had two young girls with profound disabilities, one the result in part of medical care that was inappropriate. For him, reducing unnecessary medical care was personal.
Over a dinner at a Chicago steakhouse, I told Jim that I was blowing off offers from companies that wanted to monetize our efforts because I wanted to ensure that these measures remained doctor-developed, doctor-endorsed, and doctor-friendly. I knew that it worked to collaborate with experts to create and display practice patterns and then share the data with outliers. I also knew that cutting corners on the tedious process of gaining expert physician consensus could result in doctors being unfairly measured. As a pancreas surgeon who had been unfairly branded for my high readmission rate, even though it was intrinsic to pancreas surgery, I did not want to see that happen to anyone else.
The goal was to embrace practice variation, I told Jim. Medicine is an art and different doctors take care of diverse populations. But when there is consensus, practice variation should be within physician-endorsed boundaries. “The goal here is to let outliers know that they are outliers and help guide them toward best practices,” I said. “The goal is improvement.” Jim clearly agreed.
Together, Jim and I used the same proven model to expand the project. We recruited hundreds of doctors to help us craft the practice pattern measures. Ultimately, Jim and his team began working directly with health care organizations to run the appropriateness measures in the real world. Along with Lucy Liu, Frank Roberts, and others, they formed a data analytics team to show health care organizations how their doctors were doing. They called the service Practicing Wisely.
I attended some of the meetings Jim and his team had with hospital leaders to review their data. At one medical center, we were looking at how often doctors did biopsies during a screening colonoscopy. Colon polyps occur only about 24% of the time in the general population.10 Our experts determined that doctors who remove polyps in more than 50% of screening colonoscopies have a pattern that signals possible overuse. One of the doctors at this medical center did biopsies in more than 90% of the procedures. “I’m going to have his department chair talk with him,” said the hospital’s chief medical officer.
Over the next two years, my team and I marched out the same model through many areas of medicine, focusing on unnecessary medical care as determined by expert consensus. Between my research group and the larger effort to scale our model, we created more than 500 measures of clinical appropriateness. We identified hundreds of millions of dollars in potential savings if the overuse can be eliminated—not to mention all the harm to patients that could be avoided. For each of these measures, we defined the metric carefully using input from experts, brought the results (i.e. the distribution of doctors) back to the experts in that field, and then asked them to define the range within which doctors should fall. In other words, for each of the 500 measures, we relied on the experts to define a threshold that would clearly identify outliers in that clinical situation. The goal was not to punish outliers, but to let them know where they stand and
offer help. In some instances, institutions would undertake a more focused review of outliers. In other cases, the awareness that patterns were being examined created a culture of accountability.
Metrics will need to be revisited each year based on the latest scientific research, society guidelines, and an evolving consensus among practicing specialists. The Practicing Wisely project now runs metrics in big data for health care organizations, allowing leaders to see where their doctors stand in relation to local, state, and national benchmarks. These metrics also need to be updated as medical research and practice evolve. For example, as I write this book, a new study shows that women with a common type of Stage 1 breast cancer may not need chemotherapy if a certain genetic test yields a positive result.11 Accordingly, in early 2019 we finalized a new metric: the proportion of early breast cancer patients a physician treats who have the genetic test done. If an oncologist rarely or never orders the test, it can be deduced that some of their patients are inappropriately receiving chemo.
The metrics have had a big impact in the organizations that use them. But like most innovations in health care, they are not comprehensive ways to measure quality. They are only a starting point to flag overuse in certain common clinical situations. There are many areas of medicine for which big data is too clumsy to precisely measure—for example, psychiatry. Some data can capture overuse of medications and certain medication interactions, but it’s a challenging area to measure.