The Rules of Contagion

Page 18

by Adam Kucharski

The same situation arises in the biological world. Many species have to adapt simply to keep pace with their competitors. After humans came up with antibiotics to treat bacterial infections, some bacteria evolved to become resistant to common drugs. In response, we turned to even stronger antibiotics. This put pressure on bacteria to evolve further. Treatments gradually became more extreme, just to have the same impact as lesser drugs did decades earlier.[38] In biology, this arms race is known as the ‘Red Queen effect’, after the character in Lewis Carroll’s Through the Looking-Glass. When Alice complains that running in the looking-glass world doesn’t take her anywhere new, the Red Queen replies that, ‘here, you see, it takes all the running you can do, to keep in the same place.’

This evolutionary running is about change, but it’s also about transmission. Even if a new mutation crops up in bacteria, it won’t automatically spread through a human population. Likewise, if new content emerges online, it’s not a guarantee it will become popular. We all know of new stories and ideas that have spread widely online, but we also know of posts – perhaps including our own – that have fizzled away without notice. So how common is popularity online? What does a typical outbreak even look like?

The rumours about the higgs boson spread gradually at first. On 1 July 2012, Twitter users started speculating that the elusive particle – nicknamed the ‘God particle’– had finally been discovered. Originally suggested by Peter Higgs in 1964, the boson was a crucial missing piece in the subatomic jigsaw. The laws of particle physics said it should exist, but it was yet to be observed in reality.

That would soon change. The rumours on Twitter initially claimed that physicists had discovered the boson at the Tevatron particle accelerator in Illinois. The rumour outbreak grew at a rate of about one new user per minute during this period. The next day, researchers at the Tevatron announced that they’d found promising – but not quite definitive – evidence that the Higgs boson existed. The Twitter outbreak accelerated, with more and more users joining, and attention turned to the Large Hadron Collider at CERN. These latest rumours would prove true: two days later, CERN researchers announced they had indeed found the boson. As media interest in the discovery grew, more joined the Twitter outbreak. It grew by over five hundred users per minute for the next day or so, before peaking soon after. By 6 July, five days after the first rumour emerged, interest in the story had declined dramatically.[39]

When the Higgs rumours started, some users posted about the potential discovery, while others retweeted these comments to their own followers. If we look at how the first few hundred of these retweets were connected, there is a huge amount of variation in transmission (see figure on next page). Most tweets didn’t go very far, only spreading the news to one or two others. But in the middle of the transmission network, there is a large chain of retweets, including two large-scale transmission events, with single users spreading the rumour to many other people.

This sort of diversity in transmission is common in online sharing. In 2016, Duncan Watts, then based at Microsoft Research, worked with collaborators at Stanford University to look at ‘cascades’ of sharing on Twitter. The group tracked over 620 million pieces of content, noting which users had reposted links shared by others. Some links passed between multiple users in a long chain of transmission. Others sparked but faded away much faster. Some didn’t spread at all.[40]

Initial retweets about the Higgs boson rumour, 1 July 2012. Each dot represents a user, with lines showing retweets

Data: De Domenico et al., 2013

For infectious diseases, we’ve seen there are two extreme types of outbreaks. ‘Common source’ transmission occurs when everyone gets infected from the same source, like food poisoning. At the other extreme, a propagated outbreak spreads from person-to-person over several generations. There is a similar diversity in online cascades. Sometimes content will spread to lots of people from a single source – known in marketing as a ‘broadcast’ event – whereas on other occasions it will propagate from user to user. The Stanford and Microsoft researchers found that broadcasts were a crucial part of the largest cascades. About one in a thousand Twitter posts got more than 100 shares, but only a fraction of these spread because of propagated transmission. Of the posts that spread, there was generally a single broadcast event behind its success.

When we talk about online contagion, it’s tempting to focus only on things that have become popular. However, this ignores the fact that the vast majority of things do not take off. The Microsoft team found that around 95 per cent of Twitter cascades consisted of a single tweet that nobody else shared. Of the remaining cascades, most didn’t go any further than one additional step in terms of sharing. The same is true of other online platforms: it’s extremely rare to get something that spreads, and even when it does, it doesn’t spread beyond a few generations of transmission. Most content just isn’t that contagious.[41]

In the previous chapter, we looked at outbreaks of shootings in Chicago, where transmission generally ended after a small number of events. Several diseases also stumble and stutter in human populations like this. For example, strains of bird flu like H5N1 and H7N9 have caused large outbreaks in poultry, but don’t spread well among people (at least, not for the moment).

What sort of outbreaks should we expect if something doesn’t spread very effectively? We’ve already looked at how we can use the reproduction number, R, to assess whether an infectious disease has the potential to spread or not; if R is above the critical value of one, there is potential for a large epidemic to occur. But even if R is below one, there’s still a chance an infected person will pass the disease on to someone else. It might be unlikely, but it’s possible. Unless the reproduction number is zero, we should therefore expect to get some secondary cases occasionally. These new cases may generate further generations of infection before the outbreak eventually stutters to an end.

If we know the reproduction number of a stuttering infection, can we predict how big an outbreak will be on average? It turns out that we can, thanks to a handy piece of mathematics. As well as becoming a crucial part of outbreak analysis, it’s an idea that would shape how Jonah Peretti and Duncan Watts approached viral marketing in the early days of Buzzfeed.[42]

Suppose an outbreak starts with one infectious person. By definition, this first case will generate R secondary cases on average. Then these new infections will generate R more cases each – which translates into R2 new cases – and so on:

Outbreak size = 1 + R + R2 + R3 + …

We could try and add up all these values to work out the expected outbreak size. But fortunately there’s an easier option. In the nineteenth century, mathematicians proved that there’s an elegant rule we can apply to sequences like the one above. If R is between 0 and 1, the following equation is true:

1 + R + R2 + R3 + … = 1/(1–R)

In other words, as long as the reproduction number is below 1, the expected outbreak size is equal to 1/(1–R). Even if you’re not especially interested in nineteenth-century mathematics, it’s worth taking a moment to appreciate how useful this shortcut is. Rather than having to simulate how an infection might stutter along from one generation to the next until it eventually fizzles out, we can instead estimate the final outbreak size directly from the reproduction number.[43] If R is 0.8, for example, we’d expect an outbreak with 1/(1–0.8) = 5 cases in total. And that’s not all we can do. We can also work backwards to estimate the reproduction number from the average outbreak size. If outbreaks consist of five cases on average, it means R is 0.8.

In my field, we regularly use this back-of-the-envelope calculation to estimate the reproduction number of new disease threats. During the early months of 2013, there were 130 human cases of H7N9 bird flu in China. Although most of these picked up the disease from contact with poultry, there were four clusters of infection that were likely to be the result of transmission between humans.[44] Because most people didn’t infect anyone else, the av
erage size of a human H7N9 outbreak was 1.04 cases, suggesting that R in humans was a paltry 0.04.

This idea isn’t only useful for diseases. During the mid-2000s, Jonah Peretti and Duncan Watts applied the same method to marketing campaigns. It meant they could get at the underlying transmissibility of an idea, rather than just describing what a campaign had looked like. In 2004, for example, anti gun violence group The Brady Campaign had sent out e-mails asking people to support new gun control measures. They encouraged recipients to forward the e-mails to their friends; some of these friends then forwarded the messages to their friends, and so on. For each e-mail that was sent, on average around 2.4 people ended up seeing the message. Based on this typical outbreak size, the reproduction number of the campaign was about 0.58. A subsequent e-mail campaign aimed to raise money for Hurricane Katrina relief efforts; this time R was 0.77. However, there wasn’t always so much transmission. Spare a thought for the marketing executives trying to spread messages about cleaning products: Peretti and Watts found that e-mails promoting Tide Coldwater detergent had an R of only 0.04 (i.e. the same as H7N9 bird flu). Whereas most of the Katrina e-mails had spread between multiple people, over 99 per cent of the Tide outbreaks stuttered to an end after only one transmission event.[45]

Why do we care about measuring an infection if it won’t lead to a large outbreak? For biological pathogens, a big concern is that these infections will adapt to their new hosts. During a small outbreak, viruses could pick up mutations that enable them to transmit more easily. The more people that get infected, the more chances for such adaptation. Before sars sparked a major outbreak in Hong Kong in February 2003, there were a series of small clusters of infection in Guangdong province, in southern China.[46] Between November 2002 and January 2003, seven outbreaks were reported in Guangdong, with between one and nine cases in each. The average outbreak size was five cases, suggesting that R may have been around 0.8 during this period. But by the time of the Hong Kong outbreak a couple of months later, sars had a far more troubling R of more than 2.

There are several reasons the reproduction number of an infection may increase. Recall that R depends on the four DOTS: duration of infection, opportunities for transmission, transmission probability during each opportunity, and average susceptibility. For biological viruses, all of these features can influence transmission. Of the viruses that can spread among humans, the most successful tend to cause longer infections (i.e. larger duration) and spread directly from one person to another rather than via an intermediate source (i.e. more opportunities).[47] Transmission probability can also make a difference: bird flu viruses struggle to spread among people because they can’t latch onto the cells in our airway as easily as human viruses can.[48]

The same sort of adaptation can happen with online content. There are many examples of online memes – such as posts and images – evolving to increase their catchiness. When Facebook researcher Lada Adamic and her colleagues analysed the spread of memes on the social network, they noticed that content would often change over time.[49] One example was a post that read: ‘No one should die because they cannot afford health care and no one should go broke because they get sick.’ In its original form, the meme was shared almost half a million times. But variants soon emerged, with one in every ten posts adding a mutation to the wording. Some of these edits helped the meme propagate; when people included phrases like ‘post if you agree’, the meme was almost twice as likely to spread. The meme was also highly resilient. After an initial peak in popularity, it persisted in one form or another for at least two years.

Even so, there seems to be a limit to the potential contagiousness of online content. The most popular trends on Facebook during 2014–2016 all had a reproduction number of around 2. This limit seems to occur because the different components of transmission trade off against each other. Some trends – like the ice bucket challenge – involved only a few nominations per person, but came with a high probability of transmission during each nomination. Other content, such as videos and links, had far more opportunities to spread, but in reality only a few friends who saw the post reshared it.[50] Remarkably, there were no examples of Facebook content that reached lots of friends and had a consistently high probability of spreading to each person that saw it. This serves as a reminder of just how weak online outbreaks are compared to biological infections: even the most popular content on Facebook is ten times less contagious than measles can be.

The outlook is even worse for a typical marketing campaign. Although Jonah Peretti once bet that it was possible to get something to deliberately take off, he’s since acknowledged that it’s much harder to guarantee contagion when working to a client brief.[51] Consider the difference between his original Nike e-mail, which spread widely, and those later e-mail campaigns, which were far less transmissible. Peretti and Watts have pointed out that infectious diseases have millennia of evolution on their side; marketers don’t have nearly as much time. ‘The chances are, therefore, that even talented creatives will typically design products that exhibit R less than 1, no matter how hard they try,’ they suggested.[52]

Fortunately, there is another way to increase the size of an outbreak: get the message out to more people at the start. In the above examples, we’ve been analysing stuttering outbreaks by assuming that one person is infectious at the start. If the reproduction number is small, this will lead to a small outbreak that fades away quickly. One way to fix this is to simply introduce more infections. Peretti and Watts call it ‘big seed marketing’. If we get a slightly contagious message to lots of people, it can pick up additional attention during subsequent small outbreaks. For example, if we send a non-contagious message to one thousand people, we’ll reach one thousand people. If instead we launch a message with an R of 0.8, we’d expect to reach five thousand people in total. Much of BuzzFeed’s early content became popular in this way. People saw articles on the website, then shared them with a handful of friends on sites like Facebook. Having pioneered the idea of ‘reblogging’ in the early 2000s, Peretti’s team took full advantage of it in the decade that followed. By 2013, Buzzfeed had been named the most ‘social’ publisher on Facebook, with more comments, likes, and shares than any other organisation.[53] (Huffington Post, Peretti’s former company, was second.)

If web content generally has a low R and needs multiple introductions to spread, it suggests that we shouldn’t be thinking about online contagion as if it’s the 1918 flu virus or sars. Infections like pandemic flu spread easily from person to person, which means outbreaks initially grow larger and larger over several generations of transmission. In contrast, most online content won’t reach many people unless there is some kind of mass broadcast event. According to Peretti, marketing companies will often talk about things going ‘viral’ like a disease, but they actually just mean something has become popular. ‘We were thinking in terms of an actual epidemiological definition of viral, with a certain threshold of contagion that results in it growing through time,’ as he once put it.[54] ‘Instead of exponential decay, you get exponential growth. That is what viral is.’

Most online cascades are not viral like pandemics are; they do not grow exponentially. They are actually more like the stuttering smallpox outbreaks that occurred in Europe during the 1970s. These outbreaks would generally fade away, albeit with the occasional superspreading event leading to a large cluster of cases. Yet the smallpox superspreader analogy only goes so far, because media outlets and celebrities have a reach far beyond what’s possible for biological transmission. ‘A superspreader is someone who infects, like, eleven people instead of two,’ Watts said. ‘You don’t have superspreaders who infect eleven million people.’

Given that social media cascades aren’t the same as infectious disease outbreaks, a traditional disease model won’t necessarily help us predict what will happen online. But maybe we don’t need to rely on biologically inspired predictions. Given the sheer volume of data generated on social media, researchers
are increasingly trying to identify transmission patterns, and use these to predict the dynamics of cascades.

How easy is it to predict online popularity? In 2016, Watts and his colleagues at Microsoft Research compiled data on almost a billion Twitter cascades.[55] They gathered data on the tweets themselves – such as the time posted and topic – as well as information about the users who initially tweeted them, such as their number of followers and whether they had a history of getting a lot of retweets. Analysing the resulting cascade sizes, they found that the content of the tweet itself provides very little information about whether it would be popular. As with their earlier analysis of influencers, the team found that a user’s past tweeting success was far more important. Even so, their overall prediction ability was fairly limited. Despite having the sort of dataset a disease researcher could only dream of, the team could explain less than half the variability in cascade size.

So what explained to the other half? The researchers acknowledged that there might be some additional, as-yet-unknown features of success that could improve prediction ability. However, a large amount of the variation in popularity will depend on randomness. Even if we have detailed data about what is being tweeted and who is tweeting it, the success of a single post will inevitably depend a lot on luck. Again, this shows why it is important to spark multiple cascades, rather than trying to find a single ‘perfect’ tweet.

Because it’s so difficult to predict a tweet’s popularity before it’s been posted, an alternative is to wait and look at the start of the cascade before making a prediction. This is known as the ‘peeking method’, because we’re looking at data on the early spread before we predict what will happen next.[56] When Justin Cheng and his colleagues analysed sharing of photos on Facebook in 2014, they found that their predictions got much better once they had some data on the initial cascade dynamics. Large cascades tended to show broadcast-like spread early on, picking up lots of attention quickly. Yet the team found that some features were more elusive, even with a peeking method. ‘Predicting cascade size is still much easier than predicting cascade shape,’ they noted.[57]

‹ Prev Next ›