24 October 2016
Google Flu Trends uses big data to make helpful but not totally reliable predictions of the spread of flu. Photo:
Google Flu Trends uses big data to make helpful but not totally reliable predictions of the spread of flu. Photo:

Big data, big promise?

The past is a foreign country.

So is the future.

And an integral part of the future is big data.

Big data is defined by technology research firm Forrester Research as “techniques and technologies that make capturing value from data at extreme scale economical”.

It is made possible by the fact that vast amounts of data are created and collected every day on PCs, laptops and smartphones, both intentionally and unintentionally.

Companies are keen to tap into the sea of data and make something out of it.

One classic example is Google Flu Trends.

Created with the aim of tracking the spread of influenza across the United States, the program delivered results more quickly than the US Centers for Disease Control and Prevention did in 2009.

The trick, interestingly, was to find a correlation between what people searched for online and whether they had flu symptoms.

What is exciting about big data is that it is theory-free: we do not know why it works, but the sheer abundance of the data makes it work.

Enthusiasts hope this data-driven approach will eventually displace intuition and gut feelings to produce more accurate predictions and improve decision making.

That will be good for business.

McKinsey Global Institute argues that big data in manufacturing could drive a 50 per cent fall in the costs of product development and assembly.

Big data could also enable businesses to hire the right people and provide the right training – simply because the firms have the right data.

Governments may benefit from big data, too.

In Stockholm, for instance, authorities were able to cut road travel times in half by analysing signals from the global positioning system.

This is a shot in the arm for the development of “smart cities” all across the globe.

There are two reasons, however, why we should be cautious about hyping too much the advent of big data.

The first issue is privacy.

No one would like to see his or her name, age, email address and preferences become commodities to be traded, without the owner’s consent, by companies that want to tap into the enormous pool of such data.

One way to tackle this is to regulate the use of the data.

Another is to make sure only aggregated data, not individual data, is accessible.

The second, more fundamental issue, concerns the theoretical underpinning, or lack thereof, for big data.

A quantity-based approach is inherently fragile.

David Spiegelhalter, Winton professor of the public understanding of risk at Cambridge University, argues: “There are a lot of ‘small data’ problems that occur in big data.

“They don’t disappear because you’ve got lots of the stuff.

“They get worse.”

One of the  problems is that the sample is never exhaustive: some people simply don’t surf the net.

In a similar vein, big data can lead to misleading conclusions and end up incurring costs to businesses.

As Tim Harford, author of The Undercover Economist, put it, “if you have no idea what is behind a correlation, you have no idea what might cause that correlation to break down”.

One classic example to illustrate these flaws is, ironically, Google Flu Trends.

It was found to have persistently overestimated the spread of flu by almost a factor of two.

To add insult to injury, simple statistical models that predicted the future based on past data produced far more precise results than Google Flu Trends.

This does not suggest we should reject big data altogether but that we should be holistic in our handling of data, capitalizing on the unprecedented size of sample (or even population) on one hand while upholding validity as in standard small data on the other.

Here, Amazon is exemplary.

By making use of the search history of shoppers, it devised tools that match recommended books with shoppers’ preferences – this is something brick-and-mortar bookstores fail to offer.

Every time the shopper clicks (or does not click) on the recommendation, Amazon collects even more information and refines its search facility.

This enables understanding of not only correlation but, also, to some extent, causation.

Incoherent as it may seem, the imperfect combination of big data and small data is probably the best bet for the time being.

It will at least make sure that, when we enter the foreign country known as the future, we can say with some degree of confidence how likely we will meet someone with influenza.

– Contact us at [email protected]


EJ Insight contributor

EJI Weekly Newsletter