Book Read Free

Super Thinking

Page 20

by Gabriel Weinberg


  Replication efforts are an attempt to distinguish between false positive and true positive results. Consider the chances of replication in each of these two groups. A false positive is expected to replicate—that is, a second false positive is expected to occur in a repetition of the study—only 5 percent of the time. On the other hand, a true positive is expected to replicate 80 to 90 percent of the time, depending on the power of the replication study. For the sake of argument, let’s assume this is 80 percent as we did in the last section.

  Using those numbers, a replication rate of 50 percent requires about 60 percent of the studies to have been true positives and 40 percent of them to have been false positives. To see this, consider 100 studies: If 60 were true positives, we would expect 48 of those to replicate (80 percent of 60). Of the remaining 40 false positives, 2 would replicate (5 percent of 40) for a total of 50. The replication rate would then be 50 per 100 studies, or 50 percent.

  Replication Crisis

  Re-test 100 Studies

  So, under this scenario, about a fourth of the failed replications (12 of 50) are explained by a lack of power in the replication efforts. These are real results that would likely be replicated successfully either if an additional replication study were done or if the original replication study had a higher sample size.

  The rest of the results that failed to replicate should have never been positive results in the first place. Many of these original studies probably underestimated their type I error rate, increasing their chances of being a false positive. That’s because when a study is designed for a 5 percent chance of a false positive, that chance applies only to one statistical test, but very rarely is only one statistical test conducted.

  The act of running additional tests to look for statistically significant results has many names, including data dredging, fishing, and p-hacking (trying to hack your data looking for small enough p-values). Often this is done with the best of intentions, as seeing data from an experiment can be illuminating, spurring a researcher to form new hypotheses. The temptation to test these additional hypotheses is strong, since the data needed to analyze them has already been collected. The trouble comes in, though, when a researcher overstates results that arise from these additional tests.

  The XKCD cartoon on this page illustrates how data dredging can play out: when no statistically significant relationship was found between jelly beans and acne, the scientists proceeded to dredge through twenty-one subgroups until one with a sufficiently low p-value was found, resulting in the headline “Green Jelly Beans Linked to Acne!”

  Each time another statistical test was done, the chance of forming an erroneous conclusion continued to grow above 5 percent. To see this, suppose you had a twenty-sided die. The chances of making a mistake on the first test would be the same as the chances of rolling a one. Each additional test run would be another roll of the die, each with another one-in-twenty chance of rolling a one. After twenty-one rolls (matching the twenty-one jelly bean colors in the comic), there is about a two-thirds chance that a one is rolled at least once, i.e., that there was at least one erroneous result.

  If this type of data dredging happens routinely enough, then you can see why a large number of studies in the set to be replicated might have been originally false positives. In other words, in this set of one hundred studies, the base rate of false positives is likely much larger than 5 percent, and so another large part of the replication crisis can likely be explained as a base rate fallacy.

  Unfortunately, studies are much, much more likely to be published if they show statistically significant results, which causes publication bias. Studies that fail to find statistically significant results are still scientifically meaningful, but both researchers and publications have a bias against them for a variety of reasons. For example, there are only so many pages in a publication, and given the choice, publications would rather publish studies with significant findings over ones with none. That’s because successful studies are more likely to attract attention from media and other researchers. Additionally, studies showing significant results are more likely to contribute to the careers of the researchers, where publication is often a requirement to advance.

  Therefore, there is a strong incentive to find significant results from experiments. In the cartoon, even though the original hypothesis didn’t show a significant result, the experiment was “salvaged” and eventually published because a secondary hypothesis was found that did show a significant result. The publication of false positives like this directly contributes to the replication crisis and can delay scientific progress by influencing future research toward these false hypotheses. And the fact that negative results aren’t always reported can also lead to different people testing the same negative hypotheses over and over again because no one knows other people have tried them.

  There are also many other reasons a study might not be replicable, including the various biases we’ve discussed in previous sections (e.g., selection bias, survivorship bias, etc.), which could have crept into the results. Another reason is that, by chance, the original study might have showcased a seemingly impressive effect, when in reality the effect is much more modest (regression to the mean). If this is the case, then the replication study probably does not have a large enough sample size (isn’t sufficiently powered) to detect the small effect, resulting in a failed replication of the study.

  There are ways to overcome these issues, such as the following:

  Using lower p-values to properly account for false positive error in the original study, across all the tests that are conducted

  Using a larger sample size in a replication study to be able to detect a smaller effect size

  Specifying statistical tests to run ahead of time to avoid p-hacking

  Nevertheless, as a result of the replication crisis and the reasons that underlie it, you should be skeptical of any isolated study, especially when you don’t know how the data was gathered and analyzed. More broadly, when you interpret a claim, it is important to evaluate critically any data that backs up that claim: Is it from an isolated study or is there a body of research behind the claim? If so, how were the studies designed? Have all biases been accounted for in the designs and analyses? And so on.

  Many times, this investigation will require some digging. Media sources can draw false conclusions and rarely provide the necessary details to allow you to understand the full design of an experiment and evaluate its quality, so you will usually need to consult the original scientific publication. Nearly all journals require a full section describing the statistical design of a study, but given the word constraints of a typical journal article, details are sometimes left out. Look for longer versions or related presentations on research websites. Researchers are also generally willing to answer questions about their research.

  In the ideal scenario, you would be able to find a body of research made up of many studies, which would eliminate doubts as to whether a certain result was a chance occurrence. If you are lucky, someone has already published a systematic review about your research question. Systematic reviews are an organized way to evaluate a research question using the whole body of research on a certain topic. They define a detailed and comprehensive (systematic) plan for reviewing study results in an area, including identifying and finding relevant studies in order to remove bias from the process.

  Some but not all systematic reviews include meta-analyses, which use statistical techniques to combine data from several studies into one analysis. The data-driven reporting site FiveThirtyEight is a good example; it conducts meta-analyses across polling data to better predict political outcomes.

  There are advantages to meta-analyses, as combining data from multiple studies can increase the precision and accuracy of estimates, but they also have their drawbacks. For example, it is problematic to combine data across studies where the designs or sample populations vary too much. They also cannot eliminate biases from the original studies themselves. Further,
both systematic reviews and meta-analyses can be compromised by publication bias because they can include only results that are publicly available.

  Whenever we are looking at the validity of a claim, we first look to see whether a thorough systematic review has been conducted, and if so, we start there. After all, systematic reviews and meta-analyses are commonly used by policy makers in decision making, e.g., in developing medical guidelines.

  If one thing is clear from this chapter, it’s probably that designing good experiments is tough! We hope you’ve also gathered that probability and statistics are useful tools for better understanding problems that involve uncertainty. However, as this section should also make clear, statistics is not a magical cure for uncertainty. As statistician Andrew Gelman suggested in The American Statistician in 2016, we must “move toward a greater acceptance of uncertainty and embracing of variation.”

  More generally, keep in mind that while statistics can help you obtain confident predictions across a variety of circumstances, it cannot accurately predict what will occur in an individual event. For instance, you may know that the average summer day is sunny and warm at your favorite beach spot, but that is no guarantee that it won’t be rainy or unseasonably cool the week you plan to take off from work.

  Similarly, medical research tells you that your risk of getting lung cancer increases if you smoke, and while you can estimate the confidence interval that an average smoker will get lung cancer in their lifetime, probability and statistics can’t tell you what specifically will happen for any one individual smoker.

  While probability and statistics aren’t magic, they do help you better describe your confidence around the likelihood of various outcomes. There are certainly lots of pitfalls to watch out for, but we hope you also take away the fact that research and data are more useful for navigating uncertainty than hunches and opinions.

  KEY TAKEAWAYS

  Avoid succumbing to the gambler’s fallacy or the base rate fallacy.

  Anecdotal evidence and correlations you see in data are good hypothesis generators, but correlation does not imply causation—you still need to rely on well-designed experiments to draw strong conclusions.

  Look for tried-and-true experimental designs, such as randomized controlled experiments or A/B testing, that show statistical significance.

  The normal distribution is particularly useful in experimental analysis due to the central limit theorem. Recall that in a normal distribution, about 68 percent of values fall within one standard deviation, and 95 percent within two.

  Any isolated experiment can result in a false positive or a false negative and can also be biased by myriad factors, most commonly selection bias, response bias, and survivorship bias.

  Replication increases confidence in results, so start by looking for a systematic review and/or meta-analysis when researching an area.

  Always keep in mind that when dealing with uncertainty, the values you see reported or calculate yourself are uncertain themselves, and that you should seek out and report values with error bars!

  6

  Decisions, Decisions

  IF YOU COULD KNOW HOW your decisions would turn out, decision making would be so easy! It is hard because you have to make decisions with imperfect information.

  Suppose you are thinking of making a career move. You have a variety of next steps to consider:

  You could look for the same job you’re doing now, though with some better attributes (compensation, location, mission of organization, etc.).

  You could try to move up the professional ladder at your current job.

  You could move to a similar organization at a higher position.

  You could switch careers altogether, starting by going back to school.

  There are certainly more options. When you dig into them all, the array of choices seems endless. And you won’t be able to try any of them out completely before you commit to one. Such is life.

  How do you make sense of it all? The go-to framework for most people in situations like this is the pro-con list, where you list all the positive things that could happen if the decision was made (the pros), weighing them against the negative things that could happen (the cons).

  While useful in some simple cases, this basic pro-con methodology has significant shortcomings. First, the list presumes there are only two options, when as you just saw there are usually many more. Second, it presents all pros and cons as if they had equal weight. Third, a pro-con list treats each item independently, whereas these factors are often interrelated. A fourth problem is that since the pros are often more obvious than the cons, this disparity can lead to a grass-is-greener mentality, causing you mentally to accentuate the positives (e.g., greener grass) and overlook the negatives.

  As an example, in 2000, Gabriel finished school and began a career as an entrepreneur. Early on, at times, he considered switching to a career in venture capital, where he would fund and support companies instead of starting his own. When he initially made a pro-con list, this seemed like a great idea. There were many pros (the chance to work with founders changing the world, the potential for extremely high compensation, the opportunity to work on startups in a high-leverage way without the risk and stress of being the founder, etc.) and no obvious cons.

  However, there were several cons that he just didn’t fully appreciate or just didn’t know about yet (the relentless socializing involved—not good for a major introvert—the burden of having to constantly say no to people, the difficulty of breaking into the field, the fact that much of your time is spent with struggling companies, etc.). While certainly a great career for some who get the opportunity, venture capital was not a good fit for Gabriel, even if he didn’t realize it at first. With more time and experience, the full picture has become clear (the grass isn’t greener, at least for him), and he has no plans to make that career change.

  This anecdote is meant to illustrate that it is inherently difficult to create a complete pro-con list when your experience is limited. Other mental models in this chapter will help you approach situations like these with more objectivity and skepticism, so you can uncover the complete picture faster and make sense of what to do about it.

  You’ve probably heard the phrase If all you have is a hammer, everything looks like a nail. This phrase is called Maslow’s hammer and is derived from this longer passage by psychologist Abraham Maslow in his 1966 book The Psychology of Science:

  I remember seeing an elaborate and complicated automatic washing machine for automobiles that did a beautiful job of washing them. But it could do only that, and everything else that got into its clutches was treated as if it were an automobile to be washed. I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.

  The hammer of decision-making models is the pro-con list; useful in some instances, but not the optimal tool for every decision. Luckily, there are other decision-making models to help you efficiently discover and evaluate your options and their consequences across a variety of situations. As some decisions are complex and consequential, they demand more complicated mental models. In simpler cases, applying these sophisticated models would be overkill. It is best, however, to be aware of the range of mental models available so that you can pick the right tool for any situation.

  WEIGHING THE COSTS AND BENEFITS

  One simple approach to improving the pro-con list is to add some numbers to it. Go through each of your pros and cons and put a score of −10 to 10 next to it, indicating how much that item is worth to you relative to the others (negatives for cons and positives for pros). When considering a new job, perhaps location is much more important to you than a salary adjustment? If so, location would get a higher score.

  Scoring in this way helps you overcome some of the pro-con list deficiencies. Now each item isn’t treated equally anymore. You can also group multiple items together into one score if they are interrelated. And you can now more easily compare multiple optio
ns: simply add up all the pros and cons for each option (e.g., job offers) and see which one comes out on top.

  This method is a simple type of cost-benefit analysis, a natural extension of the pro-con list that works well as a drop-in replacement in many situations. This powerful mental model helps you more systematically and quantitatively analyze the benefits (pros) and costs (cons) across an array of options.

  For simple situations, the scoring approach just outlined works well. In the rest of this section, we explain how to think about cost-benefit analysis in more complicated situations, introducing a few other mental models you will need to do so. Even if you don’t use sophisticated cost-benefit analysis yourself, you will want to understand how it works because this method is often used by governments and organizations to make critical decisions. (Math warning: because numbers are involved, there is a bit of arithmetic needed.)

  The first change when you get more sophisticated is that instead of putting relative scores next to each item (e.g., −10 to 10), you start by putting explicit dollar values next to them (e.g., −$100, +$5,000, etc.). Now when you add up the costs and benefits, you will end up with an estimate of that option’s worth to you in dollars.

  For example, when considering the option of buying a house, you would start by writing down what you would need to pay out now (your down payment, inspection, closing costs), what you would expect to pay over time (your mortgage payments, home improvements, taxes . . . the list goes on), and what you expect to get back when you sell the house. When you add those together, you can estimate how much you stand to gain (or lose) in the long term.

 

‹ Prev