Has How to Lie with Statistics by Darrell Huff been sitting on your reading list? Pick up the key ideas in the book with this quick summary.
As H. G. Wells once said, “Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
Unfortunately, a strong understanding of statistics that would allow for “statistical thinking” still has yet to take root in greater society, making How to Lie with Statistics all the more relevant for today’s readers.
In today’s world, our daily lives are saturated with statistical language: your OJ contains “26 percent more juice”; “four out of five doctors agree”; your toothpaste “kills 23 percent more germs,” and the list goes on.
On the surface, these all seem like meaningful claims that offer you more information about the products and services you’re considering buying. But, as this book summary will show you, the truth is far more complicated because people can use statistics to their own advantage.
In this summary of How to Lie with Statistics by Darrell Huff,In this book summary you’ll find out:
- why it may be safer to drive in bad weather,
- why one magazine failed at predicting the 1936 presidential elections, and
- why it’s so hard to count all the beans.
How to Lie with Statistics Key Idea #1: It’s extremely difficult to achieve a truly random sample.
Let’s imagine that you want to figure out how many red beans are in the bean barrels of Bob’s Bean Production Plant. The only way to know for certain is to dump out every single barrel of beans and count them all individually. But doing that is not only time consuming: it’s also incredibly expensive.
Luckily, using statistics, there’s an easier way.
In order to make statistical estimates, you need to create a sample, i.e., a carefully chosen data set used to represent the whole of whatever it is you want to analyze. And since sampling is the basis for drawing conclusions in statistics, creating a good sample is absolutely crucial.
But for a sample to be “good,” it must possess two qualities: it must be large enough to be statistically significant and it must be random.
We’ll address sample size a little later and focus first on randomness – because the only kind of sample that gives true statistical data is one that is purely random.
For example, if you’re interviewing 25-year-old women about how often they play guitar, you would have to randomly select 25-year-old women regardless of their income, social class or anything else.
But getting a truly random sample is easier said than done. Laborious and expensive as it is to “count every bean” by hand, finding creative ways to reach a truly randomized sample is also extremely difficult.
Looking back to our barrels of Bob’s Beans, a good sample is easy to find if the beans are randomly mixed. You just pull out a handful and you have your sample. But what if the barrel wasn’t mixed and you take a handful from the top where only white beans are?
If you’d based your sample on that and concluded that the barrel is full of white beans, you’d have fallen victim to sample bias. In the same way, non-randomized samples can bias an experiment or study.
How to Lie with Statistics Key Idea #2: Using non-random samples can lead to sample bias.
Knowing what we know now, how can we avoid biased samples without having to count “all the beans”?
One way is to employ a strategy called stratified random sampling:
First, divide your universe, i.e., the specific group of people you’ll be studying, into subgroups in proportion to their commonness. Pinpointing that proportion, however, is extremely difficult.
For example, if your universe is comprised of vegetarians, how can you know what proportion of them are a particular race, gender or age group? Without an immense databank at your disposal, it’s difficult to know.
Second, get a random sample within each subgroup. Have a list of possible interviewees from each subgroup, and then randomly interview people from your list of, for example, black vegetarians, vegetarians under 18, and so on.
As weird as it might sound, it’s actually quite difficult to keep your sample truly random. For instance: How will you make contact with a random group of vegetarians under 18?
Will you use email? Well, not all people under 18 check or have access to email. Will you call them? They don’t all have phones, either. Nevertheless, if you can’t achieve a random sample, your study will suffer from sample bias.
Literary Digest learned this the hard way when they tried to predict the outcome of the 1936 presidential election by polling their readers, who had correctly predicted the past four elections. Readers said that Alf Landon would win, but it was FDR who ended up winning by a landslide.
So why did the poll fail this time but not others? The answer lies in non-random sampling: the readers’ predictions were biased because the survey was only carried out over the telephone.
This was important because, at the time, those who could afford Literary Digest and a telephone were mostly Republican voters, thus skewing their results.
In order to produce a better survey, Literary Digest should have proportionally divided the subgroups within their readership, produced a random sample within each, and then used diverse methods to reach them.
How to Lie with Statistics Key Idea #3: All averages are not created equal.
Imagine you’re looking to buy a new house and you run across a real estate agent eager to move you into his neighborhood. Hoping to entice you, he tells you that the average income in that neighborhood is $100,000 per year. A year later, he comes back to you saying that the average income is only $20,000 per year, even though wages stayed the same and nobody moved in or out of the neighborhood. So what happened? Is he a liar?
Not precisely: he just used statistics to his advantage. His trick was to use different kinds of “averages” to change your perception.
There are, in fact, three different types of averages: the mean, the median and the mode, and each is distinct.
The mean, also known as the “arithmetic average,” is the one he used to come up with the “average” of $100,000. The mean is found by adding up all your variables and then dividing the total by the number of variables.
In this case, our real estate agent added up the incomes of each family in the area and then divided them by the number of families to arrive at $100,000.
So where did the $20,000 come from? This number was derived from the median, which describes the middle point in your sample. For example, the median of 1, 2, 6, 12 and 23 is 6, because half of the values are above 6 and half are below.
Looking back at our neighborhood, if the median is $20,000, then half families have an income below $20,000 and half of them above it.
The agent could have also used the mode, which describes the most common income in the area. If most families earn a yearly income of, for example, $22,000, then $22,000 would be the mode.
As you can see, the word “average” isn’t as straightforward as we might think, so it’s always worth asking exactly which “average” people are talking about.
How to Lie with Statistics Key Idea #4: Be aware that marketers can use chance to skew their results.
In the first book summary, we learned that our sample has to be large enough to be considered good statistical evidence.
Because if a sample is too small, it’s not “statistically significant.” To get an idea of why that is, ask yourself what the probability is of getting heads in a coin toss? 50 percent, right? Now, find a coin and toss it ten times. How often did it come up heads? Five times? It’s possible, but unlikely.
So why didn’t it come up heads 50 percent of the time as predicted? Well, because the experiment wasn’t repeated often enough; it became biased by chance. The more often we repeat the coin toss, the closer we get to the “real” probability of 50 percent.
Therefore, reliable studies use a statistically significant sample in order to ensure that the outcome of the experiment or study is not biased by chance, like our coin toss.
However, when studies fail to use statistically significant samples, they can succumb to significance bias and consequently produce sensational results. Yet, for those who want to impress us with new products or services, significance bias can be a useful tool.
For example, you’ve probably heard things like “Users report 23 percent fewer cavities with XYZ’s toothpaste” before. But are these claims trustworthy? Even if they insist that the study was conducted by an independent third party, there is reason to be skeptical.
Rather than create a controlled, statistically significant experiment, it’s very possible that they’ve simply used a small sample to get a good headline.
Think about it: if you try out a new toothpaste to test its effectiveness against cavities, what are all the possible outcomes? Fewer cavities, more cavities or the same amount of cavities.
If, after a period of time, the people in a study’s sample haven’t actually seen any improvements, then the researchers simply ignore the results and continue to conduct the experiment until the cavities heal by chance, thus rigging their experiment for the desired results.
How to Lie with Statistics Key Idea #5: Beware of the missing standard error.
Have you ever heard something that was just too good to be true? The same thing can happen when we’re talking about statistics. Luckily, in statistics, you have an easy way to find out whether someone is trying to dupe you: just find the standard error.
No measurement can be perfectly accurate because, as we’ve seen, it’s extremely difficult to “count all the beans” or create a perfect sample. We have to take into account the standard error, or the average imprecision of how our data was measured.
In order to understand the nature of the standard error, imagine you’re taking an IQ test, where 100 is the accepted average IQ for human beings. Let’s assume you scored a perfectly average 100 on your first try, 110 on the second and 90 on the third try.
Knowing that there is deviation each time you take the test, you’ll to need to find the standard error to gain the best understanding of your IQ. But how?
Start with the average IQ. In this case, 100. Then add the deviation from each result to the average. For the first try, the deviation is 0 (100 to 100); for the second (110 to 100) and the third (90 to 100), it’s 10.
You then divide your total sum of deviations (0+10+10=20) by the number of results (here, three) and voila: you have your standard error of 6.67 (20/3=6.67).
But how is this helpful? We now know that all IQs ranging from roughly 93 to 107 are considered “normal” for you. In other words: your IQ is 100 +/- 7.
As we learned earlier regarding statistical significance, this number becomes more precise the more often you test your IQ, and the same applies to anything that you want to accurately test using statistical methods, such as the national or even world “average” IQ for human beings.
How to Lie with Statistics Key Idea #6: Beware of arbitrary comparisons.
What do you do if you want to convince someone of something, but you just don’t have the proof? One way is to simply demonstrate something else as being true and then pretend it’s the same thing.
This method, which statisticians sometimes call the semi-attached figure, is among the most common deceptions in statistics.
Creating a semi-attached figure is fairly easy: you pick two or more things that sound the same – but explicitly aren’t – and draw a comparison between them. You can think of this as jumping over a gap in your argument in order to achieve a desired conclusion.
Here’s a good example: if you want to sell a cold medicine, but can’t actually prove that it works, then simply publish your laboratory report demonstrating that half an ounce of your medicine killed 40,523 germs in a test tube in under seven seconds. Now all you need is a photo of a handsome doctor, and your advertisement is ready to go!
Of course, your study did not demonstrate that the medicine actually works in the human body, or even that it wouldn’t kill your patients. However, to the untrained ear, your study might seem quite convincing!
Or consider this: your local newspaper claims that there are four times as many fatalities on the highways at 7 p.m. than at 7 a.m. The implication is that it’s more dangerous to drive at 7 p.m., but is this true? No. There are simply more people on the highways to be killed.
Another great way to deceive people with comparisons is to use percentages. All you have to do is “forget” to mention what you are actually comparing.
For example, if you drink a glass of OJ with your breakfast that contains “26 percent more juice,” you might feel like you’re making a healthy choice. But what, exactly, is it 26 percent more of?
Or if you use shampoo that “makes hair up to 60 percent shinier,” shinier than what? Are we comparing it to when you use rocks? Really, it could be anything.
How to Lie with Statistics Key Idea #7: Don’t jump to conclusions.
Not all statistical errors hide a malicious intent. Some errors result from simple misunderstandings, such as the post-hoc fallacy, which says that we often assume causal relationships between two things simply because they occur at the same time.
However, just because two things happen simultaneously doesn’t mean that’s the case. In fact, when A and B happen together, we can’t necessarily know whether A causes B or whether it was the other way around.
It could even be the case that both A and B are actually the product of some unknown factor, C!
Nevertheless, in statistics, we often look for correlations, i.e., the degree to which things show a tendency to vary together, in order to explain the world around us.
But these correlations can be demonstrably wacky. For example, you might notice that, recently, the ozone layer got thinner as the number of gay marriages increased. Does it make any sense to assume that one caused the other?
To make matters even more complicated, there are several different types of correlations.
One is entirely produced by chance: if you repeat an experiment enough times and with a small enough sample, you’re bound to produce the spectacular correlation of your choice eventually (remember the toothpaste example from earlier).
Another type of correlation fallacy is covariation. Here, the relationship between two variables is real and demonstrable, but the direction (whether A causes B, B causes A, or A and B both influence each other) is either unclear or impossible to determine.
For example, we know that wealth and stock ownership are related. But do people with more wealth buy more stocks? Or does buying stocks make you wealthy? Which influences which, and to what degree?
There are other types of correlation-causality fallacies as well, but they all come down to one simple fact: correlation is a necessary argument for causality, but is insufficient on its own.
How to Lie with Statistics Key Idea #8: Be doubtful and aware.
By now, you should understand the basics of how to lie with statistics. But these “lies” are not always born of ill motives, and many statistical errors can be caused unintentionally by simple incompetence.
That said, it’s still worth noting that most of these errors inflate and sensationalize statistics and their meanings rather than deflate and level them. So what can you do to defend yourself against bad statistics?
First and foremost, ask yourself the right questions. Consider who conducted the study and what their motives might be.
For example, studies conducted or sponsored by companies should always be scrutinized carefully because the companies are strongly motivated to produce results that favor them somehow.
Second, you should be suspicious of both stated and unstated data. Be on the lookout for small or poorly selected samples, as we know they produce biased results.
What about any reported correlations? Are their samples big enough to make them significant? Remember the coin-toss example and use your common sense: Does the study involve enough carefully selected participants? Did they pass over any important groups?
Do the authors give you the standard error? Do they specify the types of average they’re using?
If not, then something is aloof. The best case is that they simply forgot to mention this valuable information. Otherwise, you should be aware that someone might be trying to tinker with the results.
Finally, watch out for a sudden change of subject. This could be something as simple as jumping from raw data to a conclusion without connecting the dots. Ask yourself: Do these numbers actually lead to their conclusion? Or is the person creating false causality?
Unfortunately, it’s not likely that marketers, businesses and public relations experts are going to suddenly adopt a scientifically accurate, honest and ethical way of presenting information to the public. So it’s up to you to stay alert and ask the right questions if you want to keep yourself from getting duped.
In Review: How to Lie with Statistics Book Summary
The key message in this book:
Statistics is a powerful and complex tool that can easily be misused or misinterpreted to produce biased results. In order to keep yourself from getting duped by statistics, you’ll have keep a keen eye out for the tricks used to deceive us.
Actionable advice:
If there’s no sample size given, be suspicious!
Samples are only useful if they’re statistically significant. If they aren’t, then the experimenters have free rein to manipulate the results as they wish by changing the parameters of their experiment until they’ve reached the desired results, thus allowing them to make sensational claims under the guise of scientific language.