Big Data - Viktor Mayer Summary and Review

by Viktor Mayer-Schönberger and Kenneth Cukier

Has Big Data by Viktor Mayer-Schönberger and Kenneth Cukier been sitting on your reading list? Pick up the key ideas in the book with this quick summary.

Prior to the advent of computers, collecting and recording information was an arduous and time-consuming task. To put this in context, consider the information needed to complete a census of the population. Under the US Constitution, a census is required every decade, yet the 1880 census took over eight years to complete and publish. This meant the information had become obsolete before it was even made available.

But that was then. Now – with the invention of computers, digitization and the Internet – the picture has changed considerably. Information can be collected passively (or with much less effort) and at greater speeds, and the cost of storage is increasingly economical. This has brought us to the advent of the big-data era.

Although there is no formal definition, “big data” refers to both the data being captured on a much greater scale than previously possible, and the opportunities that data-sets of this size offer in terms of valuable insights discovered through analysis.

In 2009, Google provided a great example of the possibilities of big data when they published a research paper showing how they could analyze users’ search terms to predict the outbreak of flu and monitor its spread. They compared historical search-term data with data on the spread of flu in time and space from 2007 and 2008, and discovered 45 search terms that could be used in a formula to predict the spread of flu – a prediction which correlated strongly with official figures.

Only weeks after the paper was published, the outbreak of the new deadly strain of flu, H1N1, hit the headlines. Google’s system was pressed into action and provided indicators that proved to be more useful and timely than government statistics in delivering valuable information to public health officials.

Big data provides insights we could not discover by analyzing data on a smaller scale.

Big Data Key Idea #1: Data is increasingly being collected and put to use in all aspects of our lives, from the size of our bums to the way we walk.

With the rise of Internet companies such as Facebook and Twitter, and the popularity of smart devices, we have become familiar with things such as our relationship statuses, comments, preferences and location being stored as data that can then be analyzed. This trend is part of the process of datafication – capturing information about the world in the form of data.

Because we can discover valuable insights from such data, we are likely to see the trend continue, with innovations in capturing data from sources we had not previously thought of as information.

An example of this trend can be seen at Japan’s Advanced Institute of Industrial Technology, where pressure sensors are used to measure the distribution of weight our backsides put on a car seat. The research has revealed that individuals can be so accurately identified by this information that weight distribution can be used as a security device, with the car starting only for drivers it “recognizes.”

Other companies too have realized the potential in datafication. Apple applied for a patent in 2009 to passively measure the blood oxygenation, heart rate and body temperature of users through the company’s earbuds. In a similar move, IBM was awarded a patent in 2012 for touch-sensitive floor surfaces, which have the potential to identify where and how different people are moving across it.

As these examples show, researchers are already harnessing sources of information we hadn’t previously considered as data. They aim to discover valuable insights into the ways we interact and behave, with an eye on creating innovative new products.

Data is increasingly being collected and put to use in all aspects of our lives, from the size of our bums to the way we walk.

Big Data Key Idea #2: Big data frees us from the limitations of using small samples of data to represent whole populations.

Before our current technological age of the Internet and computing, information was much harder to collect and record. Accordingly, we could collect only very limited amounts of information and then try to interpret them as best we could.

For example, say you wanted to conduct a telephone survey of voters for an upcoming local election. Clearly, it would be impossible to contact the entire population, so you call a few hundred people and assume that their answers reflect the whole population’s opinions. This approach is called sampling: you take a sample of all the data, and hope it is representative of the whole.

But what if a journalist approached you after you had conducted the survey and asked you to predict the votes of a specific segment of the population, for example, the public servants?

On looking through your data, you find that you have surveyed only ten such people, and therefore can’t make very reliable predictions.

You are then asked about an even more specific subgroup, say, public servants under the age of 30. This time you have queried only one such person, and hence you can’t make any predictions at all.

This is the inherent problem with sampling: when you begin to examine smaller and smaller subgroups of data, you will quickly find that you have insufficient observations to draw any meaningful conclusions.

In a big-data world, information is much easier to collect because we have access to much more of it, or in some cases all of it. This is why in a big-data version of your election survey, you would probably have information on the voting preferences of tens of thousands of people, possibly even everyone in your town. This would make it possible to “zoom in” on subgroups in the data almost endlessly.

Big data frees us from the limitations of using small samples of data to represent whole populations.

Big Data Key Idea #3: Vast sets of messier data can be more useful than smaller, more accurate ones.

While trying to develop a language-translation program in the 1980s, the engineers at IBM had a novel idea. They decided to dispense with the standard method of using grammar rules and dictionaries and instead allowed the computer to rely on statistical probabilities to calculate which word or phrase was called for, based on samples of translated text they fed into it.

IBM’s engineers decided to use a large but limited sample of high-quality data, using three million sentence pairs from official translations of Canadian parliamentary documents. Despite promising early results, the project failed. Although the system could provide reliable translations for the most frequently used words and phrases, it was less reliable for those that occur infrequently. The system failed, despite the quality of the data. The problem was the quantity – there was simply not enough of it.

When we have only a small proportion of the data, inaccuracies can be a big problem, especially when we want to look at results that occur infrequently. But as we move to having significantly higher proportions of data, inaccuracies have a much smaller effect on the results.

Less than a decade after IBM’s failed attempt, Google decided to tackle the translation issue with a slightly different approach. They decided to use a much bigger data-set of questionable quality: the entire global Internet. Their system scoured the web and used any translation it could find, amounting to billions of pages of text. Despite the dubious quality of the input, the sheer volume of data made the system’s translations more accurate than those of any rival system.

The size of the data-sets we can have with big data allows us to be more forgiving in terms of inaccuracies in the data; having such a large proportion of the available data minimizes the effect of any inaccuracies.

Vast sets of messier data can be more useful than smaller, more accurate ones.

Big Data Key Idea #4: Big data does not tell us why two things are related, just that they are, but even this is often good enough.

When buying a used car, what criteria do you look for to ensure you don’t end up buying a clunker? You might consider things such as age, mileage, country of origin, and the make and model – these criteria all seem pretty logical. But would you take the color of the car’s paintwork into account?

In 2012, contestants in a data-analysis competition were given a similar task, and their correlation analysis revealed a surprising finding: cars painted orange were half as likely to have defects as the average car.

No doubt you are asking yourself why this would be so. Being curious about the reasons behind a relationship and developing theories to explain it is human nature. But one of the implications of big data is that we don’t need to develop our own theories about cause and effect and then test them out. Automatic analyses of all the data can deliver correlations we never even thought of looking for.

In the used-car example, the causes behind the relationship may remain invisible to us, but allowing the data to speak for itself can at least deliver correlations. And finding such correlations can already be put to practical use.

Consider research by IBM and the University of Ontario to help doctors make better decisions when caring for premature babies. By analyzing data on babies’ vital signs, they hoped to identify subtle changes that could signal the onset of infection even before the symptoms become visible. Counter-intuitively, the study revealed that babies’ vital signs became very stable prior to a serious infection – a kind of calm before the storm. Before the research, doctors would have been unconcerned with stable vital signs, but armed with this finding they are now able to provide better treatment when it is most needed.

Big data does not tell us why two things are related, just that they are, but even this is often good enough.

Big Data Key Idea #5: Although data is generally collected for a specific purpose, there are often secondary applications that hold even greater value.

When companies collect data, they generally have a specific purpose in mind: stores collect sales data for their financial accounting, factories monitor their output to track productivity, and websites track mouse movements over their pages to optimize their customers’ user experience. Consider, too, the interbank payment system Swift, which collects data on the billions of financial transactions it processes across the globe in order to provide accurate customer records.

But, increasingly, companies are finding secondary uses for the data they have collected that are sometimes even more valuable than the original use. For example, Swift discovered that its payment data correlates well with global economic activity. As a result, the company now offers highly accurate GDP forecasts derived from their transaction data.

People’s old Internet search terms are another great example of how data can find secondary uses. At face value, the information seems of little use after it has performed its primary function – returning search results to the user – but companies such as Experian allow clients to mine this data to learn about their potential customers’ tastes and market trends, a veritable gold mine to any retail company.

Similarly, mobile phone companies amass real-time location data from their users as part of routing calls. This data has numerous potential uses, from monitoring traffic-flows to delivering personal location-based advertising.

This trend has not gone unnoticed. Big-data-savvy companies and individuals, aware of the value of such data, are already designing products and systems to capitalize on the potential secondary uses of the data they and others collect. 

Although data is generally collected for a specific purpose, there are often secondary applications that hold even greater value.

Big Data Key Idea #6: Anyone can spot new opportunities to create value from the data around them – you just need the right mindset.

Owning vast amounts of data is not much use if you don’t know what to do with it. Equally, having the skills and tools to analyze data is of little use if you don’t own any data or don’t know where to get it.

Nevertheless, there are people who have neither of these things, yet still manage to find a niche for themselves in the big-data world.

The key to these individuals’ success is having a big-data mindset: an ability to recognize where available data can be mined for information of value to many people. Although people with such a mindset may not necessarily have data or data analysis skills, they are adept at spotting opportunities and capitalizing on them before others do.

One such individual is Bradford Cross. In his mid-twenties, he started the website FlightCaster with a group of friends. They combined publicly available data on flight times and historical weather records in order to predict delays in flights across the US. Their predictions became so accurate that even airline employees began referring to their site to check on their own scheduled flights.

Decide.com is another company built from these principles. Their computer system records twenty-five billion price quotes for over four million products from e-commerce sites across the web. By analyzing this information, they don’t just provide users with the cheapest price but also advise them on the best time to buy a product, predicting if and when prices are likely to rise or fall.

It’s clear that as economies are starting to form around data, more and more people are beginning to recognize the potential value of data and attempt to extract it. Individuals and companies that have a big-data mindset are best placed to capitalize on this data gold rush.

Anyone can spot new opportunities to create value from the data around them – you just need the right mindset.

Big Data Key Idea #7: Combining sets of data can create greater value than the individual parts.

As anyone who has ever played the board game Clue (i.e. Cluedo) will know, pieces of information may have little value alone but when they are combined with others they can tell you so much more. This is also true for data-sets: sometimes their value becomes apparent only when combined with other data-sets. Trends can then be found in the newly combined data that were not discoverable from the individual data-sets alone.

For example, in 2011 a Danish research group demonstrated this phenomenon. In one of the largest studies of its kind, they combined mobile phone user data with cancer patient records. This meant that they were able to check not only for a link between mobile phone use and cancer, but also whether greater mobile phone use increased the risk.

Critically, they used not merely a sample of the data but records of almost all cancer cases in the country, which allowed them to control for factors such as education and income without the data becoming unreliable. Despite the comprehensive nature of the study, the published results did not receive much media attention because no evidence of a link was found.

Although the above example involves combining different sets of data, similar effects can also be achieved by combining multiple sets of the same type of data that then provides greater value in the aggregate.

Inrix, a traffic-analysis company based in Seattle, is based on this principle. They gather real-time location data from car manufacturers, commercial fleets and their own smartphone app. Piecemeal, this info is not much use to the original data holders, but by combining it, Inrix can create timely data on traffic flows and jams for its users in return for a fee for their service.

Combining sets of data can create greater value than the individual parts.

Big Data Key Idea #8: Online services such as Facebook record everything we do on their sites and use this data to enhance the service they offer.

Most businesses use some form of customer feedback to improve the products and services they deliver. Traditionally though, it has been both time-consuming and difficult to collect feedback in large enough volumes to be meaningful.

In the age of big data and the Internet, information can be collected instantly and with much less effort than before, often even completely passively. Smart companies are already tracking everything we do online, including where we move the mouse and how long we hover over items. This information is referred to as data exhaust and is used to optimize the fine details of products, such as the size and placement of buttons.

Google is the undisputed leader of recycling data exhaust: users’ search queries and even their typos have been used to create a spell-checker and an autocomplete system, both of which are used across all Google services.

The more we interact with a website, the greater the trail of data exhaust we provide. Facebook used their rich source of data exhaust to discover that users were more likely to post content or reply to posts if they had just seen a friend do so. The layout was then amended to make friends’ interactions more visible.

Online gaming is also an area where this data trail can be pressed into service. Zynga’s online games are refined depending on how users play them: if a lot of players give up at a certain point in the game, Zynga adjusts the game to improve the players’ experience.

The examples show that companies who have both grasped the art of recycling user data and implemented it into their systems can enhance the service they deliver.

Online services such as Facebook record everything we do on their sites and use this data to enhance the service they offer.

Big Data Key Idea #9: Current privacy laws and anonymization methods are ineffective and inefficient when applied to Big Data.

It’s hard to spend any amount of time online these days without being presented with a lengthy user agreement at some point. But, be honest, do you actually read through them before agreeing to the terms?

Current privacy laws require that we are informed about what information is being collected and for what purpose, and that we then give consent, which is why we are bombarded with such requests. If the company then wants to share the data it collects, it uses anonymization – the stripping out of any personal details to preserve the privacy of the individuals – before publishing the data.

Although these methods have worked up to a point, the acceleration in the collection and use of data has meant that they are rapidly becoming obsolete.

First of all, the privacy laws prevent companies from realizing secondary uses for data. Imagine that your company has collected user data and later discovers a new and valuable use for it. Under the current system, your company would need to seek approval from every user before adopting the data for this new purpose. While the intent of the legislation is sensible, its application in a big-data world may greatly hinder the benefits that could be realized.

Second, the greater details of big data allow users to be re-identified from anonymized data, potentially revealing sensitive information in the process. For example, in 2006, AOL released a mountain of old, anonymized search terms in the hope that researchers could find interesting insights from the data. Within days, the New York Times had successfully identified one of the users as Thelma Arnold, a 62-year-old widow from Lilburn, Georgia.

The current tools, either legal or technical, are already proving to be ineffective, and as we move further down the big-data road they may become obsolete. More suitable alternatives need to be considered.

Current privacy laws and anonymization methods are ineffective and inefficient when applied to Big Data.

Big Data Key Idea #10: Big data facilitates the prediction of criminal behavior, but we must never judge someone before they have actually committed a crime.

The movie Minority Report depicts a society where predictions have become so accurate that the police arrest the would-be criminal before he or she has a chance to commit the crime. People are imprisoned not for what they have done but for what they are foreseen to do.

Although the movie is science fiction, predictions of human behavior are already used to guide certain decisions in society. As an example, parole boards in more than half of all US states use data-analysis-based predictions of a prisoner’s chance of re-offending when deciding a prisoner’s fate.

The police department in the United States is increasingly turning to “predictive policing” in order to allocate sparse resources. They use profiling – selecting individuals, groups and neighborhoods for additional scrutiny – based on characteristics seen as predictors of crime; for example, poverty, unemployment and drug-usage. Similar profiling measures are employed heavily in national security.

Yet, if misused, such methods can lead to problems of discrimination and “guilt by association.” How would you feel about being arrested on suspicion of terrorism based purely on your ethnicity, acquaintances and background?

While the additional level of detail available through big data may allow us to minimize these problems by targeting individuals rather than groups, this profiling trend is dangerous. Following this trend to its natural conclusion leads us to a world where we deny people their free will – where suspects are apprehended, patients are denied treatment or employees are dismissed – because of what they are predicted to do, not what they have done.

We have already taken tentative steps down the road of using predictions to inform decisions in the realm of law and order. If we take this trend to its extreme, we deny individuals the possibility of moral choice, something we need to guard against.

Big data facilitates the prediction of criminal behavior, but we must never judge someone before they have actually committed a crime.

Big Data Key Idea #11: Being overly data-driven can be perilous: we may be measuring the wrong thing, incentivizing the wrong behavior or relying on inaccurate data.

As our ability to collect and analyze data has developed, we have increasingly tried to use data to improve many aspects of life. However, this ability does come with certain potential pitfalls.

First of all, quantifying life can lead us to measure something that does not really capture the information we intended it to. Consider the introduction of standardized tests in education. Do a student’s standardized test scores truly reflect the range of qualities we expect education to provide?

Second, misuse of data can lead us to incentivize behavior we never intended to. Standardized tests also demonstrate this effect, as their importance has made teachers and students focus on improving test scores and not on the overall quality of education.

Finally, being overly data-driven can be problematic because we run the risk of allowing data that is biased or unreliable to shape our actions.

Consider the experience of Robert McNamara, who became the United States’ Secretary of Defense during the escalation of the Vietnam War. He became completely fixated on measuring the enemy body count as an indicator of progress, and shaped the military’s strategy around it, a decision that would later come to haunt him.

Data can be hard to verify in the chaotic conditions of war, and it later became clear that officers had reported unreliable figures. Ironically, they had done so to impress superiors such as McNamara.

With the wealth of detail and insight that big data offers, there is a risk that we could lose perspective and become so fixated on data that we fail to acknowledge its limitations or verify its quality, allowing the data to govern us in ways that create more harm than good.

Being overly data-driven can be perilous: we may be measuring the wrong thing, incentivizing the wrong behavior or relying on inaccurate data.

Final Summary

The key message of this book is:

The current use of data on such a large scale is fundamentally different from previous uses, and we need to adjust how we think about big data because of these differences. The wealth of data being collected, shared and combined can create value, enhancements and even new products or services for individuals and companies that have grasped these concepts. But we also need to guard against the misuse of big data, which could result in us losing perspective, becoming overly data-driven, or controlling and punishing people based on the results that broad analysis provides.

An actionable idea from this book in book summary:

Think creatively to extract the hidden value from the data around you

Today anyone can create value from big data, you just have to stumble upon the right data and users. Start by considering what data you have access to and also what data is freely available, particularly online. Try to think of uses for the data that are different from the reason it was initially collected, and of how this data could serve different groups or organizations by being combined with other data. Finally, think about data from the point of view of different industries or businesses, and about how they could benefit from it. In doing so, you may stumble upon an idea for a new service or product that turns the data around you into an information gold mine.