Author: Jay Cordes

Jay Cordes is a data scientist and co-author of "The Phantom Pattern Problem" and the award-winning book "The 9 Pitfalls of Data Science" with Gary Smith. He earned a degree in Mathematics from Pomona College and more recently received a Master of Information and Data Science (MIDS) degree from UC Berkeley. Jay hopes to improve the public's ability to distinguish truth from nonsense and to guide future data scientists away from the common pitfalls he saw in the corporate world. Check out his website at jaycordes.com or email him at jjcordes (at) ca.rr.com.

Is New York the New Italy?

First things first. If you start counting days after the 100th confirmed cased of COVID-19, the United States is indeed skyrocketing past every country in terms of confirmed cases.

However, the number of confirmed cases is simply a function of the number of tests administered and the existing prevalence of the disease. We know that we got a late start and are catching up fast, so this probably isn’t the best measuring stick.

The more important number to watch is the number of deaths.

By this metric, Spain is far and away the country breaking the records, going from 10 deaths to 2,808 in only 17 days. It’s already about to pass up China in terms of total deaths. If Spain is the next Italy, the United Kingdom may be the next Spain…

The UK is actually not far behind where Spain was at 10 days after 10 deaths. Meanwhile, the United States seems pretty quiet, relatively speaking…

However, at the state level, you see a different picture…

If you count days since 10 deaths (the first data point for NY above is at 10, even though the chart says it’s counting days since 5), New York is at a whopping 210 deaths after only 8 days. Compare that to Spain’s 289 and Italy’s 107 at that point and you realize that this is very alarming. NY has less than half the population of Spain and a third of the population of Italy.

Given that Italy’s death toll has been rising 20 days longer than New York’s with no end in sight, I don’t think that this thing is going to clear up by Easter.

The Scary Index

First, the bad news. Starting the day that 100 cases of COVID-19 were confirmed, the United States has reached 35,000 confirmed cases faster than any other country, even China.

However, much of that is due to finally getting some testing done, which is a good thing. Probably a more important metric to watch is the number of deaths. Here are the top 10 countries sorted by total deaths.

Among these countries, the speed at which the U.S. has ramped up deaths since the first one is only average (it took 18 days to go from 1 to 100 deaths, 23 days to go from 1 to 400) . South Korea is given a lot of credit for their extreme testing and rightly so, it took them 29 days to get up to 100 deaths. Surprisingly, France did even better, staying under 100 deaths for 31 days after their first.

I expected to see Italy as the country that hit 100 and 400 deaths the fastest at 12 and 17 days, respectively. However, Spain is in even bigger trouble than Italy was, taking only 10 days and 14 days to get there.

Italy is still the country with the highest daily number of deaths per day, however Spain is catching up fast.

Below are the cumulative totals. Italy has passed China long ago and Spain is on track to become the second country to do so.

It’s even worse for Spain when you consider population sizes. It could be where Italy is in a matter of days.

If you live in the United States, you can take comfort in the fact that, at least for the time being, the cumulative number of deaths per million population is barely on the chart. If total deaths per capita is the Scary Index, Italy and Spain are ones setting the bar.

All of these charts are available and interactive at this site.

COVID-19 “Game Changer”? Test it, but don’t bet on it.

According to a WSJ article, doctors in France, South Korea, and the U.S. are using hydroxychloroquine to treat COVID-19 “with success” and it says that we don’t have time to wait for confirmation. It refers to this study, stating “…researchers in France treated a small number of patients with both hydroxychloroquine and a Z-Pak, and 100% of them were cured by day six of treatment. Compare that with 57.1% of patients treated with hydroxychloroquine alone, and 12.5% of patients who received neither.” That sounds incredibly promising, and while the article does mention the potential downsides of the shortage of hydroxychloroquine and peddling false hope, it clearly recommends using the treatment now rather than waiting.

A shortage of the drug may not seem like a big downside until you realize that it’s actually being used by people (including my aunt) for conditions that the drug has actually been clinically proven to treat. (Update: it looks like Israel has got my aunt covered.) As for the French study, unfortunately, if you look at it closely, it is practically a poster child for the types of research that can’t be replicated. There’s no randomized control. There are a small number of treated patients (only twenty). It’s a hot field of study.

To see why these things matter, you may want to read one of the most downloaded papers in history from Dr. John Ioannidis titled “Why Most Published Research Findings Are False.” Its popularity is due to the fact that it addresses the replication crisis in science head-on and provides warning signs to watch out for before deciding to take new research seriously. In short, they are…

Small sample sizes. Have you ever noticed that if you rank cities by crime rate or cancer rate, that the ones at the top and the bottom of the list always have very small populations? This is because it’s much easier for small sets of data to show extreme results. In the same way you wouldn’t be convinced if someone says they’re the best shooter in basketball because they made 100% of their shots, your first response to the report that 100% of patients were cured of COVID-19 after 6 days with a combination of Z-Pak and hydroxychloroquine shouldn’t be “there’s a cure!”, it should be “wow, not many patients in that study.” If a study has a surprising result and a small sample size, be skeptical.

Small effect sizes. Effect sizes tend to be exaggerated, so if there’s a small effect size being reported, there’s a decent chance the real effect size is zero. In the case of the French study, the effect size is huge, perhaps suspiciously so. Only 2 of the 16 patients in the control group recovered in 6 days, while 14 of the 20 in the treatment group did. This seems overwhelmingly convincing until you see that the control group was not randomly chosen and are extremely different from the treatment group (on average, they were 14 years younger, so who knows how different they were on important features like how widespread the infection was on day one).

Also, even if we assume that the treatment and control groups were comparable, the claimed p-value of 0.001 is unlikely to be accurate. It’s calculated correctly, given the difference between control (12.5% recovering) and the treated patients (70% recovering), but ignores the fact that the treatment group had some patients removed from the analysis: “Six hydroxychloroquine-treated patients were lost in the follow-up during the survey because of early cessation of treatment. Reasons are as follows: three patients were transferred to intensive care unit…one patient stopped the treatment…because of nausea…one patient died…” Wait, what? Can we not assume that patients who went to the ICU or died would still have the virus if they were tested? Instead of just testing whether or not the drug removes the virus from patients still kicking, shouldn’t the question be whether or not the drug helps patients leave the hospital alive when all is said and done? There was also a patient counted as virus free who was found to have the virus two days later (presumably due to a false negative test). To be fair, even when I add back in the five patients I think should not have been removed from the treatment group, the p-value is still 0.01, but the filtering out of patients with clearly bad outcomes from the treatment group is not comforting.

Many tested relationships. If there are many failures that are disregarded, you can be pretty sure that the successes occurred due to chance alone. Fortunately, the French study doesn’t appear to be doing this. They mention that the combination of Hydroxychloroquine and Z-Pak had a 100% success rate, but also don’t hide the fact that it was only tried on a total of six patients. The overall focus remains on the big picture of hydroxychloroquine vs. COVID-19 as originally intended.

Flexibility in analysis. This one is very similar because it has to do with trying many different ideas before coming to a conclusion. Economist Ronald Coase summed this up nicely with the saying: “If you torture data long enough, it will confess.” There does seem to be a bit of data torture going on in the French study. These were described well and enumerated in an email from a friend, Dr. Adam Chapweske, yesterday…

Any patient whose illness becomes severe during the trial is excluded from final analysis. So it seems that a study designed to determine viral clearance explicitly excludes anyone who ends up having a really high viral load at some point in the study (assuming severe illness indicates high viral load). This ended up excluding 4 patients receiving hydroxychloroquine from the final analysis (in addition to two others lost to follow up for other reasons) and none from the control group.
Their registered secondary endpoints include clinical outcomes, but they don’t include these in their reported endpoints. As mentioned above, several patients receiving hydroxychloroquine either died or required transfer to an ICU, whereas no patients in the control group did. This makes me wonder about the other clinical data they originally planned on reporting but ad hoc decided not to mention. It’s particularly concerning since the authors themselves make very strong clinical recommendations based on their study.
Best I can tell, their decision to report on results early (i.e., prior to completing enrollment or completing their registered primary endpoint) was also ad hoc.
Their registered design does not mention azithromycin, yet they report on outcomes involving that drug and even include it in the title of their paper and in their results. Given they were not actually studying azithromycin, it would have been fine to mention the effect in the discussion section as a possible intervention for future study but they shouldn’t give the impression that it was in some meaningful sense “studied”.
The primary endpoint was virological clearance, but the baseline viral load for the control group is not given so we don’t know if the two groups are comparable with respect to this parameter. This is especially important in light of the small sample size, differences in disease (upper respiratory tract infection vs lower respiratory tract infection) and demographic and geographical differences between the two groups.
Virological measurements were reported differently for the two groups as well, which suggests to me that there were differences in the way they were tested.

Financial incentives. We all know to watch out for this one. Financial incentives are incredibly powerful and unfortunately, many people value money more than their scientific integrity. I didn’t see any reason to suspect that the researchers would benefit personally by promoting their recommended drug. They’re just reporting what they truly believe is a promising result.

And the last one: A hot field of study. If the field is hot, there is a race to publish, and with enough researchers trying different things, it’s almost certain that someone will find something somewhere that is statistically significant. This is like collective p-hacking. Rather than one researcher trying many things, you have many researchers trying one thing, and the result is the same: unreliable results. Studying the effect of drugs on COVID-19 is clearly a hot field of study. So prepare yourselves for several false positives, even from more scientifically rigorous studies than this, before we find an effective treatment. In the meantime, keep experimenting. And I’m begging you: please use a randomized control.

Update (3/23/2020): WHO announced a megatrial that will test the four most promising treatments for COVID-19. The WHO scientific panel originally wasn’t going to include the “game changer” drug, but decided to test it due to the attention its getting. According to Susanne Herold, an expert on pulmonary infections, “Researchers have tried this drug on virus after virus, and it never works out in humans. The dose needed is just too high.” Even though it doesn’t seem likely to work, I am happy to see that it was included in the megatrial. Now that the rumors are out there and people are scrambling for it, some people are inevitably going to find out the hard way that hydroxychloroquine might do more harm than good. It’s better to give it a rigorous test and provide people with solid answers rather than speculation.

jaycordes.com

COVID-19 Stats to Watch

Of all of the statistics and numbers out there, the chart I’m most interested in watching is this one…

This chart will make it clear when the spread of COVID-19 has ended its exponential growth in each country. When the number of new cases slows down, we can estimate what the final prevalence (spread of the disease) will be. In other words, when Italy’s curve levels out, we can see the light at the end of the tunnel.

Here’s the same chart with China included…

What I’m looking for is where it looks like Italy’s curve will flatten out like China’s did.

Many sources are using mathematical models estimating that 50% of the population will become infected, but I think this is much too pessimistic, given the drastic measures being taken around the world to slow down the spread (please stay home!). More specifically, there are only a couple of very small countries where the prevalence has passed 1% (see the top four below)…

Notice the final column, showing the total number of cases per million population. San Marino, a tiny country within Italy (population 33,000) has the highest percentage of infected people at 4.2% (144 cases). Because of the small population size, tiny countries like this will be the most extreme, and the numbers should be taken with a big grain of salt. Hence, the importance of watching Italy in the first chart. Because of its large population size and massive spread of the disease, it will give us a good indication of the final infection rate we can expect.

The reason prevalence is so important is because if you want to estimate your probability of dying by the disease, you need to multiply it by the fatality rate. A disease with a 100% fatality rate that only spreads to 0.01% of the population (0.01% die) is much less scary than a disease with a 1% fatality rate that spreads to half of the population (0.50% die).

John Ioannidis, the most careful thinker I’ve ever talked to, recently wrote an article suggesting that the fatality rate of COVID-19 (based on admittedly thin data) is more likely in the range of 0.05% to 1%. That would be good news compared to the current higher estimates. Also good news to me is that the estimated final spread of the disease is based on mathematical models and not what’s actually happened in other countries. Mathematical models are very useful, particularly if they motivate people to stay home and stop a pandemic in its tracks. However, if you really want to estimate your chances, watch the real world data.

Note: a reader pointed me to a very well argued opposing view to Ioannidis. I need to reiterate that that my somewhat optimistic view above is based on the assumption that the dramatic shutdowns around the world continue. I can appreciate the opposing view that “the exact numbers are irrelevant” and that we don’t want to be lulled into a “false sense of security by Ioannidis.” We should indeed continue to act in ways that avoid the worst case scenario, because in a situation like this, the cost of being too optimistic is much higher than the cost of being too pessimistic (stocks can rebound and people can eventually find jobs again). In summary, if you’re in a position of authority, please continue closing everything until this is behind us! It is this dramatic action which makes me optimistic about the future.

Link to table here.

BTW, in case you’re looking for a good book to enjoy during the Apocalypse, there are only 13 copies left on Amazon!

Comparing NBA Players Across Decades

A couple days ago, college basketball player J. J. Culver made the news by scoring an astounding 100 points in a game. Since it was against Southwestern Adventist University and not the New York Knicks, it’s not as impressive as Wilt Chamberlain’s 100-point game in the NBA. However, players at every level have progressed a lot since 1962. Or have they?

It’s incredibly difficult to compare players across different eras, since the defense evolves along with the offense. It certainly appears in video footage that NBA players in the 1960s are light-years behind modern players. Even the championship teams of that era subjectively look like Division I college teams at best. However there are two statistics that can be compared across decades: free throw percentage and the NBA Three-Point Contest shooting percentage. If players have gotten better over the years, there’s no reason to think that they only improved in some ways and not others, so their improvement should be statistically apparent across the board, including these two.

Well, it turns out that I’ve compiled a nice dataset with all of the scores in history from the NBA 3-Point Contest from various sources (including watching videos) and historical NBA free throw stats are readily available. Unfortunately, there are various numbers of 2-point “money balls” in the shooting contest, so I don’t have the exact shooting percentage, just the percentage of possible points scored. However, it’s reasonable to use this as a very good approximation of shooting percentage, since it’s hard to imagine why a player would shoot significantly better or worse when he knows the ball is worth two points. So let’s do this!

Starting with the less interesting stats, there is a significant improvement in league-wide free throw percentage over the years…

It’s not a big difference, but it’s there. The trendline shows a historical 0.06% improvement per year with a p-value of less than 0.0001, which means that it’s extremely unlikely that there would be a trend like this if the year the stats were collected were unrelated to shooting percentage. However, it looks like there were a few really bad years at the beginning that could be making all the difference. So let’s look at it since 1950.

The slope has definitely decreased (it’s now at 0.04% per year), but it’s still statistically significant at the p=0.0001 level). Of course, it would’ve been easier to simply average the shooting percentage of the last five years in the dataset and compare it to the average for the first five years and show that it’s improved by 6.9% since then. However, doing a linear regression like this provides a more accurate estimate of the actual improvement (73*0.0006 = 4.4% improvement since the beginning) since it considers all of the data. It also tells you whether or not the relationship is statistically significant. So you can see why linear regressions are a statistician’s best friend; it’s easily interpretable and fun! (BTW, the R-Squared metric is a “goodness of fit” measure that ranges from 0 to 1 (perfect fit) and this is saying that the year explains 30% of the variance in free throw scores. The other 70% probably being the presence of Wilt Chamberlain or Shaquille O’Neal dragging the percentage down. Joking!)

Okay, now for the fun one: Are NBA 3-point shooters getting better as well? During the first years of the NBA 3-Point Shootout first took place, there were some incredible performances from shooters like Craig Hodges (made 19 straight shots in one round) and Larry Bird (won the first three contests). Maybe the “splash brothers” from Golden State are outliers and that the long-distance shooting accuracy has generally remained stable since the 1980s.

It doesn’t look like it! Due to the small number of shots in the contest each year, the data is much noisier than the free throw percentage, but the trend is still clear: recent players are better shooters. The slope is steeper than the free throw trend, with an improvement of 0.26% per year, but because of the volatility in the data, the p-value isn’t as small (p=0.0013). Another way to think of the level of statistical significance is to say that we have provided strong evidence against the hypothesis that shooters are equally skilled across the decades. In science, you can’t really prove a hypothesis is true, you can only provide evidence that’s consistent with it or falsify it.

We can’t talk about the history of the NBA three-point contest without addressing the question: who is the best shooter in the contest’s history? If you simply sort by shooting percentage, this is what you get:

So, the highest shooting percentage in the contest history is Joe Harris, with an astonishing 75% accuracy. However, this isn’t the whole story. Here is the same list, with important piece of additional data: the total number of shots taken:

There is a statistical tendency for the tops (and bottoms) of ranked lists to be monopolized by the smallest sets of data. This is because it’s easier to achieve extreme results with only a few tries. Intuitively, this makes sense. If someone says “I’m the best shooter in basketball; I’m shooting 100% this year”, you already know that they’ve only made a couple baskets at most. In this case, the effect is not as extreme as it normally is, because if a shooter was on fire in one shooting round, they probably advanced to another round (it’s truly remarkable that Danny Green didn’t make it to the finals in 2019 after shooting 68%!). However, you do see a lot of players at the top of the list who only shot a couple rounds. So, how do we adjust for this “small sample size effect” and compare shooters with varying numbers of shots?

I don’t think we can. How can you say who’s a better shooter, between Joe Harris, who made 75% of his 50 shots, or Steph Curry, who made 65% of 250 shots? The only thing I think we can do is to control for the number of shots and compare players who shot the same number of rounds. Starting with the most shots and working backwards, the winner of the “475 shots” category is…

Craig Hodges! Of course, it’s hard to call him the best shooter since there’s nobody with that many shots to compare him with, and his 56% shooting isn’t particularly noteworthy (the overall average was 52%). However, he did leave us with this unforgettable performance, so he deserves recognition for that.

Similarly, Dale Ellis, and Dirk Nowitzki were the only shooters with their number of shots and only shot about 50%. However, when we get down to players with 250 shots, it gets interesting…

These shooters are no clowns. Who could have imagined that Reggie Miller would rank last on any 3-Point Shooting ranking? So, we have our first candidate for best shooter in the Three-Point Contest history. Steph Curry, with 65% accuracy, ranks best in the first contested category.

Next up is the 200-shot category.

With a 60% shooting accuracy, Kyrie Irving edges out Curry’s coach Steve Kerr. Now, I realized that I lied earlier. We can compare across these categories, if the shooter with more shots also has a higher percentage! We can unquestionably state that Curry’s 65% with 250 shots is more impressive than Irving’s 60% performance with 200 shots. So, Curry can still be considered the top shooter.

Next up: 175 Shots.

Klay Thompson takes this title with 63% shooting. Now, it’s unclear whether his higher percentage would hold up if she shot another 25 times, so we can’t clearly rank him above Kyrie Irving. However, we can say that Curry’s higher percentage is still objectively the best.

150 Shots…

We finally have a true contender for Curry. Tim Legler has technically shot a higher percentage than Curry (66% to 65%), however if he took another 100 shots, it’s fairly likely that he would regress toward the mean a bit and fail to keep up his accuracy. Since we don’t truly know what his mean is, I don’t think there’s an objective way to judge who’s more impressive unless we got Tim out there to shoot a few more rounds.

125 shots…

In the 125-shot category, Devin Booker takes it with 62% shooting. Both Curry and Legler outperformed this with more shots, so this means we can leave Booker out of the overall “most impressive shooter” contention.

100 shots…

Marco Belinelli takes this category, with a Steph-matching 65% accuracy. However, it’s more impressive to shoot 65% over 250 shots than to do it with 100 shots, so Steph and Legler’s performances are still the best.

75 shots…

Jim Les wins this one with 56% shooting, but again falls behind Steph and Legler.

Next up is the highly-contested 50-shot category. These are all the remaining players who won at least one shootout round.

Joe Harris shot a remarkable 75%, so even though the sample size is so small, we can’t rule out that his performance gives him the title for most impressive 3-Point Contest shooter in history.

So we’ve boiled it down to three contenders. Who do you consider most impressive? Your guess is as good as mine…

Steph Curry: 65% of 250 shots

Tim Legler: 66% of 150 shots

Joe Harris: 75% of 50 shots

UPDATE: Since publishing this blog article, my co-author Gary Smith (he knows a lot of things) pointed out that there is a statistical way to compare Joe Harris to Steph Curry here. He said “there is a difference-in-proportions test for the null hypothesis that two probabilities are equal” which is another way of saying that you can attempt to falsify the hypothesis that Curry’s and Harris’s shooting percentages are the same. Here’s what I get when I plug in Curry’s and Harris’s shooting percentages…

So, this is saying that the shot data we have is not sufficient to argue that Harris’s shooting performance was significantly better than Curry’s. However, it is suggestive that he may be better; in fact, if he had made two more shots out of the 50, this test would have supported the idea that Harris’s performance was significantly better than Curry’s. Time will tell!

(Below is the data I used… enjoy! Oh yeah, and buy my book, thanks!)

Year	Round	Player	Score	MaxPossible	TotalShots	Percentage
1986	1st Round	Craig Hodges	25	30	25	83%
1986	1st Round	Dale Ellis	17	30	25	57%
1986	1st Round	Kyle Macy	13	30	25	43%
1986	1st Round	Larry Bird	16	30	25	53%
1986	1st Round	Leon Wood	13	30	25	43%
1986	1st Round	Norm Nixon	9	30	25	30%
1986	1st Round	Sleepy Floyd	13	30	25	43%
1986	1st Round	Trent Tucker	19	30	25	63%
1986	2nd Round	Craig Hodges	14	30	25	47%
1986	2nd Round	Dale Ellis	14	30	25	47%
1986	2nd Round	Larry Bird	18	30	25	60%
1986	2nd Round	Trent Tucker	13	30	25	43%
1986	Finals	Craig Hodges	12	30	25	40%
1986	Finals	Larry Bird	22	30	25	73%
1987	1st Round	Byron Scott	9	30	25	30%
1987	1st Round	Craig Hodges	13	30	25	43%
1987	1st Round	Dale Ellis	13	30	25	43%
1987	1st Round	Danny Ainge	14	30	25	47%
1987	1st Round	Detlef Schrempf	19	30	25	63%
1987	1st Round	Kiki Vanderweghe	12	30	25	40%
1987	1st Round	Larry Bird	13	30	25	43%
1987	1st Round	Michael Cooper	15	30	25	50%
1987	2nd Round	Danny Ainge	8	30	25	27%
1987	2nd Round	Detlef Schrempf	16	30	25	53%
1987	2nd Round	Larry Bird	18	30	25	60%
1987	2nd Round	Michael Cooper	10	30	25	33%
1987	Finals	Detlef Schrempf	14	30	25	47%
1987	Finals	Larry Bird	16	30	25	53%
1988	1st Round	Brent Price	14	30	25	47%
1988	1st Round	Byron Scott	19	30	25	63%
1988	1st Round	Craig Hodges	10	30	25	33%
1988	1st Round	Dale Ellis	16	30	25	53%
1988	1st Round	Danny Ainge	14	30	25	47%
1988	1st Round	Detlef Schrempf	15	30	25	50%
1988	1st Round	Larry Bird	17	30	25	57%
1988	1st Round	Trent Tucker	11	30	25	37%
1988	2nd Round	Byron Scott	11	30	25	37%
1988	2nd Round	Dale Ellis	12	30	25	40%
1988	2nd Round	Detlef Schrempf	5	30	25	17%
1988	2nd Round	Larry Bird	23	30	25	77%
1988	Finals	Dale Ellis	15	30	25	50%
1988	Finals	Larry Bird	17	30	25	57%
1989	1st Round	Craig Hodges	20	30	25	67%
1989	1st Round	Dale Ellis	19	30	25	63%
1989	1st Round	Danny Ainge	11	30	25	37%
1989	1st Round	Derek Harper	12	30	25	40%
1989	1st Round	Gerald Henderson	17	30	25	57%
1989	1st Round	Jon Sundvold	12	30	25	40%
1989	1st Round	Michael Adams	12	30	25	40%
1989	1st Round	Reggie Miller	15	30	25	50%
1989	1st Round	Rimas Kurtinaitis	9	30	25	30%
1989	2nd Round	Craig Hodges	18	30	25	60%
1989	2nd Round	Dale Ellis	18	30	25	60%
1989	2nd Round	Gerald Henderson	13	30	25	43%
1989	2nd Round	Reggie Miller	11	30	25	37%
1989	Finals	Craig Hodges	15	30	25	50%
1989	Finals	Dale Ellis	19	30	25	63%
1990	1st Round	Bobby Hansen	15	30	25	50%
1990	1st Round	Craig Ehlo	14	30	25	47%
1990	1st Round	Craig Hodges	20	30	25	67%
1990	1st Round	Jon Sundvold	15	30	25	50%
1990	1st Round	Larry Bird	13	30	25	43%
1990	1st Round	Mark Price	11	30	25	37%
1990	1st Round	Michael Jordan	5	30	25	17%
1990	1st Round	Reggie Miller	16	30	25	53%
1990	2nd Round	Bobby Hansen	14	30	25	47%
1990	2nd Round	Craig Hodges	17	30	25	57%
1990	2nd Round	Jon Sundvold	17	30	25	57%
1990	2nd Round	Reggie Miller	18	30	25	60%
1990	Finals	Craig Hodges	19	30	25	63%
1990	Finals	Reggie Miller	18	30	25	60%
1991	1st Round	Clyde Drexler	8	30	25	27%
1991	1st Round	Craig Hodges	20	30	25	67%
1991	1st Round	Danny Ainge	18	30	25	60%
1991	1st Round	Dennis Scott	16	30	25	53%
1991	1st Round	Glen Rice	9	30	25	30%
1991	1st Round	Hersey Hawkins	14	30	25	47%
1991	1st Round	Terry Porter	15	30	25	50%
1991	1st Round	Tim Hardaway	15	30	25	50%
1991	2nd Round	Craig Hodges	24	30	25	80%
1991	2nd Round	Danny Ainge	11	30	25	37%
1991	2nd Round	Dennis Scott	12	30	25	40%
1991	2nd Round	Terry Porter	14	30	25	47%
1991	Finals	Craig Hodges	17	30	25	57%
1991	Finals	Terry Porter	12	30	25	40%
1992	1st Round	Craig Ehlo	10	30	25	33%
1992	1st Round	Craig Hodges	16	30	25	53%
1992	1st Round	Dell Curry	11	30	25	37%
1992	1st Round	Drazen Petrovic	13	30	25	43%
1992	1st Round	Jeff Hornacek	7	30	25	23%
1992	1st Round	Jim Les	15	30	25	50%
1992	1st Round	John Stockton	11	30	25	37%
1992	1st Round	Mitch Richmond	12	30	25	40%
1992	2nd Round	Craig Hodges	15	30	25	50%
1992	2nd Round	Drazen Petrovic	8	30	25	27%
1992	2nd Round	Jim Les	20	30	25	67%
1992	2nd Round	Mitch Richmond	11	30	25	37%
1992	Finals	Craig Hodges	16	30	25	53%
1992	Finals	Jim Les	15	30	25	50%
1993	1st Round	B.J. Armstrong	11	30	25	37%
1993	1st Round	Craig Hodges	14	30	25	47%
1993	1st Round	Dan Majerle	10	30	25	33%
1993	1st Round	Dana Barros	15	30	25	50%
1993	1st Round	Kenny Smith	12	30	25	40%
1993	1st Round	Mark Price	19	30	25	63%
1993	1st Round	Reggie Miller	14	30	25	47%
1993	1st Round	Terry Porter	16	30	25	53%
1993	2nd Round	Craig Hodges	16	30	25	53%
1993	2nd Round	Dana Barros	15	30	25	50%
1993	2nd Round	Mark Price	21	30	25	70%
1993	2nd Round	Terry Porter	17	30	25	57%
1993	Finals	Mark Price	18	30	25	60%
1993	Finals	Terry Porter	17	30	25	57%
1994	1st Round	B.J. Armstrong	8	30	25	27%
1994	1st Round	Dale Ellis	20	30	25	67%
1994	1st Round	Dana Barros	13	30	25	43%
1994	1st Round	Dell Curry	12	30	25	40%
1994	1st Round	Eric Murdock	13	30	25	43%
1994	1st Round	Mark Price	15	30	25	50%
1994	1st Round	Mitch Richmond	12	30	25	40%
1994	1st Round	Steve Kerr	18	30	25	60%
1994	2nd Round	Dale Ellis	10	30	25	33%
1994	2nd Round	Dana Barros	17	30	25	57%
1994	2nd Round	Mark Price	21	30	25	70%
1994	2nd Round	Steve Kerr	13	30	25	43%
1994	Finals	Dana Barros	13	30	25	43%
1994	Finals	Mark Price	24	30	25	80%
1995	1st Round	Chuck Person	15	30	25	50%
1995	1st Round	Dan Majerle	9	30	25	30%
1995	1st Round	Dana Barros	9	30	25	30%
1995	1st Round	Glen Rice	14	30	25	47%
1995	1st Round	Nick Anderson	12	30	25	40%
1995	1st Round	Reggie Miller	17	30	25	57%
1995	1st Round	Scott Burrell	19	30	25	63%
1995	1st Round	Steve Kerr	13	30	25	43%
1995	2nd Round	Chuck Person	16	30	25	53%
1995	2nd Round	Glen Rice	19	30	25	63%
1995	2nd Round	Reggie Miller	19	30	25	63%
1995	2nd Round	Scott Burrell	17	30	25	57%
1995	Finals	Glen Rice	17	30	25	57%
1995	Finals	Reggie Miller	16	30	25	53%
1996	1st Round	Clifford Robinson	11	30	25	37%
1996	1st Round	Dana Barros	18	30	25	60%
1996	1st Round	Dennis Scott	19	30	25	63%
1996	1st Round	George McCloud	18	30	25	60%
1996	1st Round	Glen Rice	17	30	25	57%
1996	1st Round	Hubert Davis	17	30	25	57%
1996	1st Round	Steve Kerr	18	30	25	60%
1996	1st Round	Tim Legler	23	30	25	77%
1996	2nd Round	Dennis Scott	19	30	25	63%
1996	2nd Round	George McCloud	17	30	25	57%
1996	2nd Round	Steve Kerr	17	30	25	57%
1996	2nd Round	Tim Legler	22	30	25	73%
1996	Finals	Dennis Scott	18	30	25	60%
1996	Finals	Tim Legler	20	30	25	67%
1997	1st Round	Dale Ellis	12	30	25	40%
1997	1st Round	Glen Rice	16	30	25	53%
1997	1st Round	John Stockton	13	30	25	43%
1997	1st Round	Sam Perkins	8	30	25	27%
1997	1st Round	Steve Kerr	15	30	25	50%
1997	1st Round	Terry Mills	11	30	25	37%
1997	1st Round	Tim Legler	17	30	25	57%
1997	1st Round	Walt Williams	18	30	25	60%
1997	2nd Round	Glen Rice	14	30	25	47%
1997	2nd Round	Steve Kerr	21	30	25	70%
1997	2nd Round	Tim Legler	19	30	25	63%
1997	2nd Round	Walt Williams	12	30	25	40%
1997	Finals	Steve Kerr	22	30	25	73%
1997	Finals	Tim Legler	18	30	25	60%
1998	1st Round	Charlie Ward	15	30	25	50%
1998	1st Round	Dale Ellis	18	30	25	60%
1998	1st Round	Glen Rice	13	30	25	43%
1998	1st Round	Hubert Davis	15	30	25	50%
1998	1st Round	Jeff Hornacek	17	30	25	57%
1998	1st Round	Reggie Miller	12	30	25	40%
1998	1st Round	Sam Mack	14	30	25	47%
1998	1st Round	Tracy Murray	12	30	25	40%
1998	2nd Round	Charlie Ward	11	30	25	37%
1998	2nd Round	Dale Ellis	15	30	25	50%
1998	2nd Round	Hubert Davis	24	30	25	80%
1998	2nd Round	Jeff Hornacek	15	30	25	50%
1998	Finals	Hubert Davis	10	30	25	33%
1998	Finals	Jeff Hornacek	16	30	25	53%
2000	1st Round	Allen Iverson	10	30	25	33%
2000	1st Round	Bob Sura	9	30	25	30%
2000	1st Round	Dirk Nowitzki	18	30	25	60%
2000	1st Round	Hubert Davis	14	30	25	47%
2000	1st Round	Jeff Hornacek	17	30	25	57%
2000	1st Round	Mike Bibby	15	30	25	50%
2000	1st Round	Ray Allen	16	30	25	53%
2000	1st Round	Terry Porter	15	30	25	50%
2000	Finals	Dirk Nowitzki	11	30	25	37%
2000	Finals	Jeff Hornacek	13	30	25	43%
2000	Finals	Ray Allen	10	30	25	33%
2001	1st Round	Allan Houston	11	30	25	37%
2001	1st Round	Bryon Russell	10	30	25	33%
2001	1st Round	Dirk Nowitzki	17	30	25	57%
2001	1st Round	Pat Garrity	15	30	25	50%
2001	1st Round	Peja Stojakovic	19	30	25	63%
2001	1st Round	Rashard Lewis	12	30	25	40%
2001	1st Round	Ray Allen	20	30	25	67%
2001	1st Round	Steve Nash	14	30	25	47%
2001	Finals	Dirk Nowitzki	10	30	25	33%
2001	Finals	Peja Stojakovic	17	30	25	57%
2001	Finals	Ray Allen	19	30	25	63%
2002	1st Round	Mike Miller	10	30	25	33%
2002	1st Round	Paul Pierce	8	30	25	27%
2002	1st Round	Peja Stojakovic	20	30	25	67%
2002	1st Round	Quentin Richardson	14	30	25	47%
2002	1st Round	Ray Allen	14	30	25	47%
2002	1st Round	Steve Nash	15	30	25	50%
2002	1st Round	Steve Smith	8	30	25	27%
2002	1st Round	Wesley Person	21	30	25	70%
2002	Finals	Peja Stojakovic	19	30	25	63%
2002	Finals	Steve Nash	18	30	25	60%
2002	Finals	Wesley Person	19	30	25	63%
2003	1st Round	Antoine Walker	7	30	25	23%
2003	1st Round	Brent Barry	19	30	25	63%
2003	1st Round	David Wesley	12	30	25	40%
2003	1st Round	Pat Garrity	13	30	25	43%
2003	1st Round	Peja Stojakovic	19	30	25	63%
2003	1st Round	Wesley Person	14	30	25	47%
2003	Finals	Brent Barry	17	30	25	57%
2003	Finals	Peja Stojakovic	20	30	25	67%
2003	Finals	Wesley Person	20	30	25	67%
2003	Finals-Overtime	Peja Stojakovic	22	30	25	73%
2003	Finals-Overtime	Wesley Person	16	30	25	53%
2004	1st Round	Chauncey Billups	12	30	25	40%
2004	1st Round	Cuttino Mobley	13	30	25	43%
2004	1st Round	Kyle Korver	19	30	25	63%
2004	1st Round	Peja Stojakovic	21	30	25	70%
2004	1st Round	Rashard Lewis	16	30	25	53%
2004	1st Round	Voshon Lenard	18	30	25	60%
2004	Finals	Kyle Korver	15	30	25	50%
2004	Finals	Peja Stojakovic	16	30	25	53%
2004	Finals	Voshon Lenard	18	30	25	60%
2005	1st Round	Joe Johnson	8	30	25	27%
2005	1st Round	Kyle Korver	14	30	25	47%
2005	1st Round	Quentin Richardson	14	30	25	47%
2005	1st Round	Ray Allen	13	30	25	43%
2005	1st Round	Vladimir Radmanovic	6	30	25	20%
2005	1st Round	Voshon Lenard	17	30	25	57%
2005	Finals	Kyle Korver	18	30	25	60%
2005	Finals	Quentin Richardson	19	30	25	63%
2005	Finals	Voshon Lenard	17	30	25	57%
2006	1st Round	Chauncey Billups	12	30	25	40%
2006	1st Round	Dirk Nowitzki	14	30	25	47%
2006	1st Round	Gilbert Arenas	14	30	25	47%
2006	1st Round	Jason Terry	13	30	25	43%
2006	1st Round	Quentin Richardson	12	30	25	40%
2006	1st Round	Ray Allen	19	30	25	63%
2006	Finals	Dirk Nowitzki	18	30	25	60%
2006	Finals	Gilbert Arenas	16	30	25	53%
2006	Finals	Ray Allen	14	30	25	47%
2007	1st Round	Damon Jones	15	30	25	50%
2007	1st Round	Dirk Nowitzki	20	30	25	67%
2007	1st Round	Gilbert Arenas	23	30	25	77%
2007	1st Round	Jason Kapono	19	30	25	63%
2007	1st Round	Jason Terry	10	30	25	33%
2007	1st Round	Mike Miller	18	30	25	60%
2007	Finals	Dirk Nowitzki	9	30	25	30%
2007	Finals	Gilbert Arenas	17	30	25	57%
2007	Finals	Jason Kapono	24	30	25	80%
2008	1st Round	Daniel Gibson	17	30	25	57%
2008	1st Round	Dirk Nowitzki	17	30	25	57%
2008	1st Round	Jason Kapono	20	30	25	67%
2008	1st Round	Peja Stojakovic	15	30	25	50%
2008	1st Round	Richard Hamilton	14	30	25	47%
2008	1st Round	Steve Nash	9	30	25	30%
2008	Finals	Daniel Gibson	17	30	25	57%
2008	Finals	Dirk Nowitzki	14	30	25	47%
2008	Finals	Jason Kapono	24	30	25	80%
2009	1st Round	Daequan Cook	18	30	25	60%
2009	1st Round	Danny Granger	13	30	25	43%
2009	1st Round	Jason Kapono	16	30	25	53%
2009	1st Round	Mike Bibby	14	30	25	47%
2009	1st Round	Rashard Lewis	17	30	25	57%
2009	1st Round	Roger Mason	13	30	25	43%
2009	Finals	Daequan Cook	15	30	25	50%
2009	Finals	Jason Kapono	14	30	25	47%
2009	Finals	Rashard Lewis	15	30	25	50%
2009	Finals-Overtime	Daequan Cook	19	30	25	63%
2009	Finals-Overtime	Rashard Lewis	7	30	25	23%
2010	1st Round	Channing Frye	15	30	25	50%
2010	1st Round	Chauncey Billups	17	30	25	57%
2010	1st Round	Daequan Cook	15	30	25	50%
2010	1st Round	Danilo Gallinari	15	30	25	50%
2010	1st Round	Paul Pierce	17	30	25	57%
2010	1st Round	Stephen Curry	18	30	25	60%
2010	Finals	Chauncey Billups	14	30	25	47%
2010	Finals	Paul Pierce	20	30	25	67%
2010	Finals	Stephen Curry	17	30	25	57%
2011	1st Round	Daniel Gibson	7	30	25	23%
2011	1st Round	Dorell Wright	11	30	25	37%
2011	1st Round	James Jones	16	30	25	53%
2011	1st Round	Kevin Durant	6	30	25	20%
2011	1st Round	Paul Pierce	12	30	25	40%
2011	1st Round	Ray Allen	20	30	25	67%
2011	Finals	James Jones	20	30	25	67%
2011	Finals	Paul Pierce	18	30	25	60%
2011	Finals	Ray Allen	15	30	25	50%
2012	1st Round	Anthony Morrow	14	30	25	47%
2012	1st Round	James Jones	22	30	25	73%
2012	1st Round	Kevin Durant	20	30	25	67%
2012	1st Round	Kevin Love	18	30	25	60%
2012	1st Round	Mario Chalmers	18	30	25	60%
2012	1st Round	Ryan Anderson	17	30	25	57%
2012	Finals	James Jones	12	30	25	40%
2012	Finals	Kevin Durant	16	30	25	53%
2012	Finals	Kevin Love	16	30	25	53%
2012	Finals-Overtime	Kevin Durant	16	30	25	53%
2012	Finals-Overtime	Kevin Love	17	30	25	57%
2013	1st Round	Kyrie Irving	18	30	25	60%
2013	1st Round	Matt Bonner	19	30	25	63%
2013	1st Round	Paul George	10	30	25	33%
2013	1st Round	Ryan Anderson	18	30	25	60%
2013	1st Round	Stephen Curry	17	30	25	57%
2013	1st Round	Steve Novak	17	30	25	57%
2013	Finals	Kyrie Irving	23	30	25	77%
2013	Finals	Matt Bonner	20	30	25	67%
2014	1st Round	Arron Afflalo	15	30	25	50%
2014	1st Round	Bradley Beal	21	30	25	70%
2014	1st Round	Damian Lillard	18	30	25	60%
2014	1st Round	Joe Johnson	11	30	25	37%
2014	1st Round	Kevin Love	16	30	25	53%
2014	1st Round	Kyrie Irving	17	30	25	57%
2014	1st Round	Marco Belinelli	19	30	25	63%
2014	1st Round	Stephen Curry	16	30	25	53%
2014	Finals	Bradley Beal	19	30	25	63%
2014	Finals	Marco Belinelli	19	30	25	63%
2014	Finals-Overtime	Bradley Beal	18	30	25	60%
2014	Finals-Overtime	Marco Belinelli	24	30	25	80%
2015	1st Round	James Harden	15	34	25	44%
2015	1st Round	JJ Redick	18	34	25	53%
2015	1st Round	Klay Thompson	24	34	25	71%
2015	1st Round	Kyle Korver	18	34	25	53%
2015	1st Round	Kyrie Irving	23	34	25	68%
2015	1st Round	Marco Belinelli	18	34	25	53%
2015	1st Round	Stephen Curry	23	34	25	68%
2015	1st Round	Wes Matthews	22	34	25	65%
2015	Finals	Klay Thompson	14	34	25	41%
2015	Finals	Kyrie Irving	17	34	25	50%
2015	Finals	Stephen Curry	27	34	25	79%
2016	1st Round	CJ McCollum	14	34	25	41%
2016	1st Round	Devin Booker	20	34	25	59%
2016	1st Round	James Harden	20	34	25	59%
2016	1st Round	JJ Redick	20	34	25	59%
2016	1st Round	Khris Middleton	13	34	25	38%
2016	1st Round	Klay Thompson	22	34	25	65%
2016	1st Round	Kyle Lowry	15	34	25	44%
2016	1st Round	Stephen Curry	21	34	25	62%
2016	Finals	Devin Booker	16	34	25	47%
2016	Finals	Klay Thompson	27	34	25	79%
2016	Finals	Stephen Curry	23	34	25	68%
2017	1st Round	CJ McCollum	10	34	25	29%
2017	1st Round	Eric Gordon	25	34	25	74%
2017	1st Round	Kemba Walker	19	34	25	56%
2017	1st Round	Klay Thompson	18	34	25	53%
2017	1st Round	Kyle Lowry	9	34	25	26%
2017	1st Round	Kyrie Irving	20	34	25	59%
2017	1st Round	Nick Young	18	34	25	53%
2017	1st Round	Wes Matthews	11	34	25	32%
2017	Finals	Eric Gordon	20	34	25	59%
2017	Finals	Kemba Walker	17	34	25	50%
2017	Finals	Kyrie Irving	20	34	25	59%
2017	Finals-Overtime	Eric Gordon	21	34	25	62%
2017	Finals-Overtime	Kyrie Irving	18	34	25	53%
2018	1st Round	Bradley Beal	15	34	25	44%
2018	1st Round	Devin Booker	19	34	25	56%
2018	1st Round	Eric Gordon	12	34	25	35%
2018	1st Round	Klay Thompson	19	34	25	56%
2018	1st Round	Kyle Lowry	11	34	25	32%
2018	1st Round	Paul George	9	34	25	26%
2018	1st Round	Tobias Harris	18	34	25	53%
2018	1st Round	Wayne Ellington	17	34	25	50%
2018	Finals	Devin Booker	28	34	25	82%
2018	Finals	Klay Thompson	25	34	25	74%
2018	Finals	Tobias Harris	17	34	25	50%
2019	1st Round	Buddy Hield	26	34	25	76%
2019	1st Round	Danny Green	23	34	25	68%
2019	1st Round	Devin Booker	23	34	25	68%
2019	1st Round	Dirk Nowitzki	17	34	25	50%
2019	1st Round	Joe Harris	25	34	25	74%
2019	1st Round	Kemba Walker	15	34	25	44%
2019	1st Round	Khris Middleton	11	34	25	32%
2019	1st Round	Stephen Curry	27	34	25	79%
2019	Finals	Buddy Hield	19	34	25	56%
2019	Finals	Joe Harris	26	34	25	76%
2019	Finals	Stephen Curry	24	34	25	71%

The Scientific Mindset

Did you miss my rousing keynote at the Claremont Data Science conference? You’re in luck! Below are my slides, and every somewhat witty and thought-provoking thing I said…

Hi, I’m Jay. I’m super excited to be here, especially because this is the first time I’ve ever done a talk like this! My friend told me that this experience would be good for my next book, the 9 pitfalls of public speaking. Funny guy! My co-author Gary is sitting over there. You do know he’s the one with a PhD from Yale, right? You had a 50/50 shot at hearing from a genius today! Don’t worry, though, I’m not a total clown. I graduated from Pomona College with a degree in math and worked as a software developer for 11 years before following my inner data-wonk to the Analytics department of a booming Internet company. I would take the Metrolink from Claremont to downtown L.A. every day and those years of grand successes and epic failures taught me the value of scientific rigor.

I had a manager once who liked to say “up is up”, which I took as meaning that data speaks for itself. I strongly disagree. Data needs an interpreter. One who knows things. And machines don’t know anything. They’ll crunch random data and find statistical significance everywhere.

On top of that, as you can see here, it’s not even always clear what “up” is! We had all the data you could ever want at my work. Billions of rows piled into a Netezza database. So we tried to use that wealth of data to answer a simple question: what’s more profitable: 1-click pages or 2-click pages? The answer we got back is that overall, 2-click pages are better. Then we asked, okay, what about in the U.S.? One-clicks are better there. How about outside of the U.S.? One-clicks are better there too. Which “up” are we supposed to believe? This is Simpson’s Paradox in all of its glory. In this case, the weighted average looks like that because we had U.S. traffic mostly on 2-clicks and International traffic mostly on 1-clicks. It’s even worse than that! The reason most U.S. traffic was on 2-clicks was because we had run randomized A/B tests that showed 2-clicks are better here, so even that top line is backwards! We decided to stick with the experiments so we don’t get fooled by confounding variables.

Data science is often said to be about extracting knowledge from data. Well, as you can see from this example, if you’re talking about historical data as opposed to data produced by an A/B test, you need to be very careful to ensure that what you’ve extracted is knowledge and not nonsense. Data science is less about extracting knowledge than creating useful data that can provide knowledge. Up is definitely not always up.

I hope to convince you today that for data science to work, you need to work like a scientist.

When people talk about what data scientists do, they always mention statistics and computer programming and also might say how important it is to have domain or subject knowledge. What they tend to forget is the “science” part. I’m here to tell you that the scientific mindset: the critical thinking, the skepticism, the willingness to put your predictions to the test, and make sure you’re not fooling yourself, is essential.

Rather than just go through all the pitfalls of data science, I’d like to talk about four ways that a scientific mindset can avoid them. (1) You can effectively interpret data (does it mean what you think it means?) (2) You can identify which features might be useful for making predictions. Machines can’t really do that for you because if you include too many nonsense variables, they crowd out the real ones. (3) You’ll be able to evaluate evidence and develop a Spidey Sense and avoid being fooled by the “silent evidence of failures”. Are you seeing the whole picture or is someone showing you the statistical hits and hiding the misses? The last one is that you run experiments whenever possible, because it’s the strongest evidence out there.

Okay, so let’s put your critical thinking to the test. What’s this data saying? This is earthquake data from the United States Geological Survey showing an alarming increase in the number of major earthquakes worldwide over the last century. Is the apocalypse approaching? Is the earth breaking apart? Or is something wrong with this data?

Don’t worry. These are the earthquakes that were recorded each year, not the number that occurred. There is now a far more extensive network of seismometers than in the past, so many earthquakes that went unnoticed decades ago now get monitored and logged.

If the data tells you something crazy, there’s a good chance you would be crazy to believe it.

Too easy? Give this one a shot. At Berkeley, I was on a group that analyzed data for over 7,000 patients with sepsis at a Chicago hospital to find a way to predict the chances of being readmitted to the hospital after being discharged. You can see here that we found a strong relationship between the pH level of the patient’s blood (normally between 7.35 to 7.45) to the hospital readmission rates.

There is a clear positive relationship, indicating that patients with high pH levels are more likely to return to the hospital soon after being discharged. A low pH signals that a discharged patient is unlikely to be readmitted. The correlation is 0.96 and data clowns would call it a day. “Up is up”!

However, my teammates and I were not clowns, so we made sure to run this by a doctor to see if it made sense. When he saw this figure, a puzzled look came across his face: “That’s strange; the relationship is backwards. If you have a low pH level, you’re probably dead,” but the chart implied that having a very low pH level was a sign of health. This stumped us until we realized that the data included patients who died during their hospital stay! We had simply found that the patients least likely to be readmitted are the ones who were discharged to the mortuary.

This figure shows that, once we removed the deceased patients, the pattern reversed. Now there is a negative relationship, just as the doctor expected.

This one shows the clear danger of acidic blood by comparing pH level with the likelihood of death. Patients with pH values below 7.2 are not in good health, they are in serious danger. In this case, the data spoke, but it was talking about something else.

In this case, only the scientific mindset saved us from embarrassment.

It gets even trickier. How can you dismiss patterns that repeat consistently? After Oklahoma won 47 straight college football games, Sports Illustrated ran a 1957 cover story proclaiming, “Why Oklahoma is Unbeatable.” Oklahoma lost its next game and people started noticing that other athletes or teams who appear on the cover of Sports Illustrated tend to perform worse afterward. The Sport’s Illustrated Jinx was born. More recently, we have the Madden Curse, which says that the football player whose picture appears on the cover of Madden NFL, a football video game, will not perform as well the next season. The Sports Illustrated jinx and the Madden Curse are extreme examples of regression toward the mean. When a player or team does something exceptional enough to earn a place on the cover of Sports Illustrated or Madden NFL, there is essentially nowhere to go but down. To the extent luck plays a role in athletic success, the player or team that stands above all the rest almost certainly benefited from good luck—good health, fortunate bounces, and questionable officiating. Good luck cannot be counted on to continue indefinitely, and neither can exceptional success. There’s a Swedish proverb which states “Luck doesn’t give, it only lends.”

We ran into this at work. My company specialized in maximizing profit for domainers, or people who collect websites in order to show ads. We designed and conducted experiments to find the best design. So for example, a web visitor comes in and we generate a random number to determine which page they go to. When we then compared how the various pages performed, we knew we could trust the results because no possible confounding variable could be correlated with a random number. If we had just used a different layout each day, the results might be muddled by the nature of web traffic on the different days—for instance, people are typically more likely to click on ads on a Monday than over the weekend. So anyway, we know what the most profitable design was and used it all over the place.

Anyway, our customers have collections of domain names and, of course, some names do better than others. Some would ask us to work on their “underperformers” and see if we could get revenue up. So my friend in Analytics would change the web page design or the keywords and every single time, revenue would go up by 20% the next day. He was a hero and they were like “you should do this full time!” We made the point that, to be scientifically rigorous, we should really only be working on a random half of the names in order to have a control for comparison, but they thought we were being ridiculous. A twenty percent revenue lift the day after he made changes and we’re nitpicking the process?

Well, one day he forgot to make the changes and the next day, revenue went up for the names by 20% like they always did. It was like an anti-jinx! Instead of the best performers getting on the cover of Sports Illustrated, this was the worst performers getting emailed to Analytics. Someone came by his desk to congratulate him again and he said “I didn’t get around to it yet” and they said “well, whatever you did worked!” Now, we knew for sure that we had to hold back a control, because there was no way to know if what he was doing was helping or hurting!

It turns out that Regression toward the Mean is everywhere. Let’s quickly go through a few more examples…

Why is there a sophomore slump? It’s because you’re looking at the best freshmen. Whether you’re looking at the best batting averages or any other statistic, the top performers will almost always do worse the next time.

Why are movie sequels typically worse than the originals? Well, if you’ve been paying attention, you know how to fix this one. If you want sequels that are better than the originals, make sequels to the worst movies!

Why does punishment seem to work better than reward? Johnny does something exceptionally well and you give him a treat. Then, he does worse. Johnny does something exceptionally badly and you whack him. Then, he does better. The same thing would have happened without the treats or the whacking.

There was a study showing that kids who underperformed on the SAT did better the next time if they were on a drug called propranalol to relax them. As soon as you heard, “underperformed”, I hope your Spidey Sense tingled. They’re expected to do better if they underperformed! The kids almost certainly did worse on the drug than they would have without it, but you’d never know, because they didn’t use a randomized control.

Now, to be clear, this is not the even-steven theory from Seinfeld. Your luck will not reverse, you will just become less lucky.

So you can see why people believe in jinxes! By the way, if you think this talk is exceptionally interesting right now, I have bad news about the rest of the presentation (I knock on wood).

So interpreting data requires the scientific mindset, but what about finding good predictive features? Here’s a tough one for you. What is this thing?

Did anyone get it? Of course! This is easy for humans. Even if the sign were bent, rusty, or has a peace sticker on it, we would still know what it is. Not so with image-recognition software. During their training sessions, Deep Neural Net algorithms learn that the words “stop sign” go with images of many, many stop signs. Because they look at individual pixels, computer programs can be led astray by trivial variations. People can exploit this and intentionally cause a misidentification with tiny changes, called an adversarial attack.

Gary did a quick-and-dirty test by putting a peace sign on an image of a stop sign to see what a DNN would conclude. It misidentified the image as a first-aid kit. In 2018, the organizers of a machine-learning conference announced that they had accepted 11 papers proposing ways to thwart adversarial attacks like this. Three days later, an MIT graduate student, a Berkeley graduate student, and a Berkeley professor reported that they had found ways to work around 7 of these defense systems. There is clearly an AI arms race going on.

So how can thinking like a scientist possibly help a neural net work better? I talked to one to find out! “The Retinator”, Dr. Michael Abramoff, invented an autonomous AI system to diagnose diabetic retinopathy (DR), which is the leading cause of blindness in working-age adults. He said it took him decades, but he eventually succeeded in building a neural net AI system that performed as well as a doctor

So, if neural nets can be confused about stop signs, how did the Retinator keep his AI from being fooled?

His approach was to classify images of the eye the same way retinal experts do, by looking for specific signs of DR. He developed multiple detectors to look for known predictive features such as hemorrhages and other biomarkers. He also wanted his results to be comprehensible so that doctors and patients understand the diagnosis. If his system failed to recognize a case of DR, he wanted to know why it failed. He said “If I give clinicians an image with a bunch of hemorrhages, they’ll say ‘This is likely DR.’ If I start taking those hemorrhages away, eventually they’ll say ‘there’s no disease here.’” His biomarker AI system works the same

He wouldn’t settle for a black box, blank slate approach, because he knew that would risk catastrophic failure. In theory, letting computers teach themselves which characteristics are useful might find important features that clinicians didn’t know about. However, much of the data is irrelevant, so many features found to be correlated statistically with a DR diagnosis will be spurious. As with the stop sign detector, errors can arise when algorithms are put on auto-pilot. In the case of DR, there might be a bias due to the color of the retina, a different kind of background, or even part of the border around the image. A black-box model can fail with new images, with no one knowing why it failed.

Here you can see an example of a catastrophic failure of a black box algorithm. Its diagnosis is so fragile that you don’t even need a peace sign; changes in pixels that humans can’t even perceive can completely change the prediction. The Retinator’s system wasn’t tricked because it only considers the small number of features that make sense. In healthcare, the possibility of these adversarial images is particularly concerning because of the ability to make fraudulent claims by exploiting automated diagnoses.

In April 2018, Dr. Abramoff’s Idx-DR system became the first FDA approved autonomous AI diagnosis system.

Feature selection isn’t just a problem with neural nets. In 2011, Google created a program called Google Flu that used search queries to predict flu outbreaks. They reported that their model had a correlation of 0.975 with the actual number of flu cases from the CDC. Their data-mining program looked at 50 million search queries and identified the 45 queries that were most closely correlated with the incidence of flu. It was pure-and-simple data-mining. A valid study would use medical experts to specify a list of relevant query phrases in advance, and then see if there was an uptick in these queries shortly before or during flu outbreaks. Instead, Google’s data scientists had no control over the selection of the optimal search terms. The program was on its own, with no way of telling whether the search queries it found were sensible or nonsense. Google Flu may have been simply a winter detector. When it went beyond fitting historical data and began making real predictions, Google Flu was far less accurate than a simple model that predicted that the number of flu cases tomorrow will be the same as the number today. After issuing its report, Google Flu overestimated the number of flu cases by an average of nearly 100 percent. Google Flu no longer makes flu predictions.

Now I want to be clear: this type of automated data-mining is not what helped Google take over the world. It’s the thousands of rigorous A/B tests that they run that allowed them to do that. Having a huge amount of data to analyze for patterns is not enough, and Google knows that.

Compare Google Flu with how Wal-Mart stocks its shelves when a hurricane is on the way. Customers don’t just buy water and flashlights; they also buy strawberry Pop-Tarts and beer. Since historical data was analysed, this appears at first glance to be more mindless data mining. However, it is actually more like a controlled experiment! Recall that one major downside of data mining is the possibility of confounding variables. However, since hurricanes only affect a few stores out of many, Wal-Mart had a natural experiment that eliminates confounding influences like the day of the week or season of the year. This is almost as good as letting mad scientists randomly choose cities to be blasted by hurricanes and then comparing the shopping habits of the lucky and unlucky residents. The scientific method is alive and well at Wal-Mart. Another problem with data mining is that correlation can get confused with causation. It is highly unlikely that customers stocked up on Pop-Tarts in threatened cities for some reason other than the hurricanes. Also, unless buying Pop-Tarts causes hurricanes, the relationship clearly goes in the other direction. We might no know exactly why people buy these products, but we do know that hurricanes caused the increase in demand. An additional reason to believe in the Pop-tart / hurricane connection is that the association makes sense. Pop-tarts don’t have to be cooked and last practically forever. Taking advantage of natural experiments like this is something a scientists would think of.

So data doesn’t speak for itself and features don’t select themselves. Let’s shift a bit and talk about the “silent evidence of failures”. There are actually two versions of what’s called the Texas Sharpshooter Fallacy. The first one is that in order to prove what a great shot I am, I paint a thousand targets on a barn wall, fire my gun at it, and what a surprise, I hit a target! And then I go and erase all the other targets. And, of course, it’s meaningless, because, with so many targets, I’m bound to hit something… So the Texas Sharpshooter Fallacy #1 is testing lots and lots of different theories and reporting the one that seems confirmed by the data and not telling anybody that you tested many other theories. This fallacy contributes to the replication crisis in science, because there’s a publication bias towards significant findings. You’ll see the hits but not the misses.

Texas Sharpshooter Fallacy #2 is the picture here. You just fire your gun blindly at the wall and then you go and draw a target around the bullet hole and pretend that’s what you were aiming for. That’s like looking at the data and finding some coincidental little thing in there and pretending that’s what you were looking for in the first place.” I think there should be a third one, where there’s only one target, but you just keep shooting until you hit it. Then, you hide all of the bullet holes outside of the target and show what a good shot you are. This is what they call p-hacking, which is a reference to testing again and again until hitting that magical statistically significant p-value of 0.05.

Here’s a simple example: Derren Brown is a mentalist who said he could flip ten heads in a row with a fair coin. This is an astonishing claim since there is only a 1 in 1024 chance of doing that. Brown backed up his claim with a video filmed from two angles. There were no cuts in the video, it wasn’t a trick coin, and there were no magnets or other trickery involved. Is your Spidey Sense tingling?

In a later video, he gave away his secret: he had simply filmed himself flipping coins for nine hours until he got ten heads in a row. The video seemed magical, but it was a tedious trick. Brown’s prank is a clear example of how our perception can be distorted by what Nassim Taleb called the “silent evidence” of failures. If we don’t know about the failures, how can we evaluate the successes? As you develop your scientific Spidey Sense, you’ll notice that a lot of evidence starts to look like videotaped hot streaks.

So that was about interpreting evidence. Now I’d like to talk about creating compelling evidence. Suppose I want to convince you that poker is a game of skill and that you’re skeptical of my claim. Let’s say you think poker is a game of chance because the cards are randomly dealt and poker professionals often go broke. What evidence would convince you?

Well, regarding pros going broke, let me tell you about an interesting gambling experiment. Participants were given $25 and challenged to make as much money as they could in 30 minutes. They would do this by betting on a virtual coin that lands on heads 60 percent of the time. Clearly, betting on heads is a winning strategy, but how much should you bet? It turns out that something called the Kelly criterion gives an elegant answer: bet the “edge.” The edge is the difference between winning and losing chances so the edge for this game is 60–40, or 20 percent. If you bet 20 percent of your money on each flip, you can expect to make more money than you would by following any other strategy. I still remember the day Art Benjamin taught me that (he was in the audience). Most people in the study bet much more than this, and 28 percent lost all their money, even though they were winning 60% of the time. Despite the fact that the results of this experiment depend on coin tosses, I would consider this a game of skill because people who know what they are doing can expect to make more money than those who don’t. I would argue that broke poker pros are like the 28% who lost their money here. This experiment shows that betting on the right outcome is a different skill than knowing what stakes you can afford.

To create the strongest evidence that poker is a game of skill, I ran my own experiment. I predicted I could beat online poker and I put my own money on the line to prove it. I invested a whopping $50 into an online poker site and gathered data on how players respond to an immediate but small all-in bet. I downloaded the data from these experimental games and analyzed it. It turns out that people call too often. I used the data to determine which hands would win the maximum amount of money per hour assuming my opponents didn’t adjust. I called my strategy “the System”.

As you can see, when I started using the system, my opponents were no match for data science and they probably wish lucked played a bigger role in poker than it does.

This is not to say that the element of luck doesn’t shake things up once in awhile. It may not look like much on the chart, but at one point, my bankroll dropped $1800. Fortunately, I knew about the Kelly Criterion, so I never played at stakes that were so high that I could go broke. Of course, it was possible that my opponents had finally adjusted to my strategy so I analyzed the data again. I found out that it was just bad luck that had taken a bite out of my bankroll. I continued playing the System and my luck turned around as expected. By showing you results that would be exceedingly unlikely to happen by chance, now you almost have to believe my claim that poker is a game of skill. Data science isn’t just about analyzing data correctly, it’s about presenting your findings in a compelling way. And nothing is more compelling than experimental results.

You’ve waited long enough, so here are the 9 Pitfalls of Data Science. I want to you look down the list and notice something they have in common. That is that these are problems that can’t be solved by automation; they’re job security for data scientists! The Google Flu and Wal-mart Pop-Tarts stories describe different ways of analyzing historical data and show that the less mindless the approach, the better. Analysis on auto-pilot doesn’t work because data doesn’t speak for itself and up is not always up.

Similarly, the Retinator’s autonomous AI system got approved by the FDA because it was more than a black box stop sign identifier. People still have an important role in focusing computers on the features that matter.

The final take-away is that the way around the pitfalls is to follow the path of scientists. Be critical thinkers because data and features don’t always make sense. Be skeptics, because anyone can torture data until it backs them up. Be experimenters, because the strongest evidence is the evidence that could have gone against you. Put the science into data science. Be scientists.

Thank you so much for listening!

(As an Amazon Associate I earn from qualifying purchases!)

Quiz Time! Count the Pitfalls

I recently received an email from a financial advice firm about “rational decision making” that had a promising intro: “We’ve discovered five common investor biases and emotions that can lead to below-average returns.” Biases, I don’t like those. Emotions hurting returns, those aren’t good either. I’m listening!

It describes loss aversion (irrationally holding a losing security in hopes that it will recover) and anchoring (relying too much on an initial piece of information) before offering this description of hindsight bias…

Hindsight bias – Many investors believe that an asset will perform well in the near future because it has performed well in the recent past. As a result, some investors are constantly chasing returns. If we reflect on this, it is actually counterintuitive. The number one rule of investing is buy low and sell high. Take a look at the S&P 500 chart above. If you have not owned the S&P 500 Index over the last nine years, is now the time to buy after it is 300 percent more expensive than it was nine years ago?

Okay, how many problems can you find in that last sentence?

I count three!

“If you have not owned the S&P 500 Index …” Why mention this? It is a sunk cost fallacy to consider whether you bought something in the past. It’s either a good investment or it’s not.
“…over the last nine years…” This is classic cherry-picking. Where did the number nine come from? You can bet it came from the last time the S&P hit a low point.
“…is now the time to buy after it is 300 percent more expensive than it was nine years ago?” This is the gambler’s fallacy. It’s rational to expect something that’s done extremely well to do less well (regression toward the mean), but it’s not rational to imply that it’s now a bad investment due to its recent history. There is no force of nature that requires all good returns to be balanced out by bad returns. There is irony in providing this comment after explaining the anchoring bias to the reader.

Beware of people advising to “buy low and sell high” as if they know what low and high are. If it were that easy, the firm should just send out an email that says “LOW” or “HIGH” in the subject line so its customers can act accordingly and beat the market.

If you spotted the data science pitfalls in that financial advice, congratulations, you’re well on your way to becoming a skeptical and savvy consumer of data!

It’s Normal to be Irrational

In response to my blog on Roundup, I received this email (and no, it wasn’t addressed “dear Wonk”)…

“This is an excellent article and presentation of how the desired outcome affects the stated conclusion of a study. Based on news headlines, I have vilified Roundup. This is a convincing reminder to dig deeper before taking a position.

What I see as a similar debate is the one about the risks and efficacy of vaccines. It is a particularly hot topic here in Oregon as the state legislature is working on a bill that removes the option for a non-medical exemption from vaccination for school children. If the child is not vaccinated and does not have a medical exemption, that child will not be allowed to attend public school.

I find this similar to the Roundup issue because I have been told that there are studies that support both conclusions: vaccines do cause autism and other auto-immune disease and vaccines do not cause these conditions. I have not done any research myself. I understand that the linchpin study supporting the harmfulness of vaccines has been retracted. What is the truth?

I have friends looking to move out of the state if this bill becomes law.

I would like to understand the science before addressing the personal liberty issue of such a law.

Thanks for waking up my critical thinking skills.”

The study referenced was indeed retracted, but only after 12 long years. Even after it became known that the lead author failed to inform anyone that the study “was commissioned and funded for planned litigation” and that he had falsified data, his report continued to cause a decline in the vaccination rates. While there is always a slight risk of some severe allergic reaction, there is no evidence for a link between vaccines and disease, and the dangers they prevent are far greater than any that cause. By creating a belief in a false link between vaccines and autism that have directly led to lost lives, the retracted study may go down as one of the most damaging cases of fraud in history.

By the way, if you’re ever in need of evidence for the efficacy of vaccines, look no further than these visualizations…

From a scientific standpoint, this issue looks like a slam dunk, so why are there still so many people trying to get out of vaccinations? For one, many are well aware of the evils of Big Pharma and profit-driven healthcare: the $30 billion budget for medical marketing in the US has brought us the opioid crisis, unaffordable drugs, and medication overload (40% of the elderly are taking five or more medications). It’s hard to imagine that public health is valued nearly as much as profit in this country. However, given the phony autism study above, created in the hope of winning lawsuits, maybe people who are watching out for Big Pharma simply need to learn to also watch out for Big Lawya.

I’m sure that awareness wouldn’t hurt, but it isn’t enough. Debunking studies misrepresenting the dangers of vaccines and ushering in piles of evidence about their benefits will probably have little effect on someone who wants to opt out. So what is it that’s actually causing these people to leave their kids and others vulnerable to potentially deadly diseases?

I’m thinking it’s the misguided fear of regret. In Michael Lewis’s book The Undoing Project, he mentions a paper about decision analysis called “The Decision to Seed Hurricanes.” There was a new technique available to the government (dumping silver iodide into storms) which could reduce the amount of damage done by hurricanes or alter their paths. However, while the government would not be given credit for what had been avoided (since nobody would know for sure), it would certainly be blamed for the damage the storm inflicted on its new path. This asymmetry between credit and blame causes a bias towards non-intervention, which struck me as similar to a parent’s decision for or against a vaccination. Their child may or may not have been on a path towards an infectious disease and if the vaccine turned out to be a life-saving factor later on, nobody would know.

Behavioral economists often model people as rational decision-makers who don’t seek risk or avoid risk, they weigh risk. They are expected to maximize their utility, which is a clever metric used to standardize the value of various outcomes so that math can be used to find the choices with the best expected value. However, psychologists Kahneman and Tversky found that rather than maximizing utility, people seek to minimize regret. In a memo to Tversky, Kahneman wrote “The pain that is experienced when the loss is caused by an act that modified the status quo is significantly greater than the pain that is experienced when the decision led to the retention of the status quo. When one fails to take action that could have avoided a disaster, one does not accept responsibility for the occurrence of the disaster.”

If you point out that people are irrational, they take it personally in a way that they don’t if you pointed out that they’ve been fooled by an optical illusion. What psychologists have discovered is that it’s normal to be irrational when faced with particular types of problems, so we shouldn’t take it personally. We should just learn when those times are, and resist the pull towards bad decision-making. We shouldn’t be angry when governments require us to become vaccinated, we should be thankful. They are saving us from our tendency to make bad decisions.

You may have read about the recent public health concern for moviegoers at a theater showing the Avengers a few weeks ago who may have been exposed to someone with measles. Here’s my takeaway idea to help people overcome their cognitive blind-spots on this: those who were at that theater and remained healthy should publicly thank their parents for vaccinating them. “I didn’t get measles the other day, thanks mom!” When we can, let’s make a point to acknowledge the moments when disaster didn’t strike and give credit where it’s due.

Reasonable Doubt: The Roundup Debate

A little bird sent me this request: “How about an analysis of the Roundup thing?”

I’d read about the multimillion-dollar lawsuits recently won against Monsanto, an agricultural biotechnology best known for its public relations challenges and questionable ethics. However, I wasn’t aware that Bayer had purchased the company last year and can now look forward to over 11,000 lawsuits! It certainly appeared as if the verdict was in about the dangers of Roundup, but jurors aren’t known for their ability to evaluate scientific evidence. While Monsanto has a financial incentive to try to influence studies, lawyers also do well in times of public hysteria (“mesothelioma” was the top-paying Internet ad at my last job). So let’s take a crack at getting beyond all of the perverse incentives at play and focus on the main question: Does Roundup cause cancer?

With controversial topics like these, it’s important to first look for a scientific consensus. In this case, the EPA, European Food Safety Authority, Food and Agriculture Organization, European Chemicals Agency, Health Canada, German Federal Institute for Risk Assessment and others have concluded that, at the levels people are exposed to glyphosate, the active ingredient in Roundup, it does not pose a risk of cancer. However, the consensus on glyphosate is not unanimous; there is one organization, the International Agency for Research on Cancer (IARC) which classified glyphosate as a “probable carcinogen.” Is this the only agency to escape Monsanto’s influence or is there another explanation?

It turns out that the IARC evaluates risk in a different way than the other agencies. It determines if the substance can cause cancer with exposure levels far more extreme than any that would be found in the real world. Practically everything is dangerous in high amounts, (including water) and the IARC, accordingly, has only found one out of the hundreds of agents they have evaluated as being “probably not carcinogenic.” I’m not accusing the IARC of practicing pseudoscience, but let’s just say that I’m sleeping better now that I know they’re the ones behind the California Prop 65 cancer warnings at fast food restaurants. I figure that as long as I don’t ingest 400 chalupas per day, I’ll probably be okay.

Due to the consensus of worldwide regulatory agencies (and with IARC’s conclusion put into context) I would already feel comfortable concluding that there is not sufficient evidence showing that Roundup causes cancer. However, let’s go down a level to the studies themselves and see what we find. The reason I didn’t start here is because individual studies can be very unreliable, especially when it comes to epidemiological studies (as opposed to controlled experiments). That said, one of the strongest experimental designs for these types of studies is the “prospective cohort study”, which follows a population of people with various exposure levels to the chemical over time and, only later, determines whether or not the groups show significant differences in health. While they can have their conclusions reversed due to unconsidered confounding variables (“Oops, people living close to power lines tend to be poorer and have less access to healthcare”), these types of studies at least avoid the problem of selective recall that plagues case-control studies: (“Hmm, I didn’t know what caused my tumor, but now that you mention it, I DO spend a lot of time on the cell phone!”). Following the surprising IARC conclusion, a study revisited and followed up on data from the large Agricultural Health Study (AHS). It found, in agreement with earlier conclusions, “no association was apparent between glyphosate and any solid tumors or lymphoid malignancies overall, including NHL and its subtypes.”

It certainly is looking like the evidence against Roundup is weak. However, a recent study in the journal Mutation Research threw a monkey wrench into things and associated weed killing products with non-Hodgkin lymphoma (NHL). It used the same AHS data above and combined it with a few less reliable case-control studies to conclude that people exposed to glyphosate have a 41% higher likelihood of developing NHL.

I’m a bit uncomfortable with the fact that it used the same data from a study that found no significant risk, added in less reliable data, and then concluded that there IS a risk. That seems like taking advantage of post-hoc wiggle-room. Another problem is that the 20-year time lag is the only one mentioned in the analysis. Why not report the results of the 15-year or 10-year lag since exposure? The 20-year lag was the only one that showed a relative risk greater than 0%. Coincidence? Read my upcoming book and you’ll suspect that this is Pitfall #5: Torturing Data. The study reports a 95% confidence interval as if they had a hypothesis, tested it, and found an increase in risk that would be unlikely if Roundup weren’t dangerous. In reality, when they skipped over data points that didn’t support their conclusion before landing on the one that did, the likelihood they would find something increased significantly. I can’t help but wonder if they would have even bothered to combine data from the less reliable studies if the AHS data showed significance on its own. I get the impression they found the result they did, because they went looking for it, and accordingly, their conclusion should be taken with a grain of salt. It would be analogous to asking “what are the chances I found an Easter egg in this particular place, given there were 20 possible places to search?” and not mentioning that you had searched a few other places before you found it. This may seem nit-picky when only a few results weren’t mentioned, but their whole conclusion of “statistical significance” hinges on it!

Observational studies like this are unreliable in the best of circumstances. They have the burden of showing that higher doses lead to a higher likelihood of illness (dose-response relationship). They have the burden of controlling for variables such as age, family history, body weight, and other things that may bias the results (confounding variables). For an extreme example, suppose there were a study that was much more compelling because it took blood samples of thousands of people and everyone with cancer had Roundup in their blood and everyone without cancer did not. A slam dunk case! Until later they find out that everyone with cancer also had DDT or some other chemical in their blood due to the fact that they were all farmers using a variety of insecticides. Suddenly, the case could fall apart.

Even if this study had carefully done everything possible and found that higher exposure to Roundup led to higher chances of developing NHL and also had a strong reason ahead of time that it would only show up after a 20-year lag, it would still be one observational study up against a consensus of health agencies around the world. People want to believe that science can confidently answer questions like “what causes cancer” by simply analyzing data. The truth is that, without randomized testing, science is severely hobbled. At best, it can draw tentative conclusions when data is carefully collected and analyzed by scientists trained in not fooling themselves and haven’t peeked at results before forming their hypotheses. Before you vote “guilty” in your next jury, remember that scientific consensus among multiple scientific organizations represents our best understanding of the world. In general, if you rely on “worldwide conspiracy and bribery” as the explanation for why scientific organizations are wrong, your argument is in trouble. No matter how compelling a conspiracy theory may be, the weight of scientific consensus should provide you with more than a reasonable doubt.

Disagree? Let me know what I got wrong and I’ll post updates. And keep those ideas coming for future blog entries!

https://www.motherjones.com/environment/2019/03/glyphosate-roundup-cancer-non-hodgkin-lymphoma-epa-panel-hardeman-lawsuit-jury-verdict/

https://www.skepticalraptor.com/skepticalraptorblog.php/glyphosate-linked-non-hodgkin-lymphoma-analysis/

https://www.science20.com/geoffrey_kabat/paper_claims_a_link_between_glyphosate_and_cancer_but_fails_to_show_evidence-236698

Artificial Unintelligence

The AI Delusion
by Gary Smith
Oxford University Press, 256 pp., USD $27.95.

“I for one welcome our new computer overlords” – Ken Jennings, Jeopardy champion

In The AI Delusion, economist Gary Smith provides a warning for mankind. However, it is not a warning about machines, it is about ourselves and our tendency to trust machines to make decisions for us. Artificial Intelligence is fantastic for limited and focused tasks but is not close to actual general intelligence. Professor Smith points out that machines, for which all patterns in data appear equally meaningful, have none of the real-world understanding required to filter out nonsense. Even worse is the fact that many of the new algorithms hide their details so we have no way of determining if the output is reasonable. Even human beings, when not engaging their critical thinking skills, mistakenly draw conclusions from meaningless patterns. If we blindly trust conclusions from machines, we are falling for the AI delusion and will certainly suffer because of it.

The Real Danger of Artificial Intelligence

Speculators about the future of artificial intelligence (AI) tend to fall into one of two camps. The first group believes that, when hardware reaches the same level of complexity and processing speed as a human brain, machines will quickly surpass human-level intelligence and lead us into a new age of scientific discovery and inventions. As part of his final answer of the man vs. machine match against IBM’s Watson, former Jeopardy! champion Ken Jennings seemed to indicate that he was in this first camp by welcoming our computer overlords. The impressive AI system, which beat him by answering natural language questions and appeared to understand and solve riddles, made fully intelligent machines seem to be right around the corner.[1]

The second camp dreads an AI revolution. Having grown up on sci-fi movies, like the Matrix and the Terminator, they worry that superior intelligence will lead machines to decide the fate of mankind, their only potential threat, in a microsecond. Alternatively, and more realistically, they see a risk that AI machines may simply not value or consider human life at all and unintentionally extinguish us in their single-minded pursuit of programmed tasks. Machines may find a creative solution that people did not anticipate and endanger us all.

Gary Smith convincingly presents his belief that neither of these views is correct. If achieving true AI is like landing on the moon, all of the impressive recent advances are more like tree-planting than rocket-building. New advancements are akin to adding branches to the tree, and getting us higher off the ground, but not on the path towards the moon.

Humanity has turned away from the exceedingly difficult task of trying to mimic the way the brain works and towards the easier applications (such as spell-checkers and search engines) that leverage what computers do well. These new applications are useful and profitable but, if the goal is for machines to be capable of understanding the world, we need to start over with a new approach to AI. Machines gaining human-like intelligence is not something around the corner unless we start building rockets.

The AI Delusion warns us that the real danger of AI is not that computers are smarter than we are but that we think computers are smarter than we are. If people stop thinking critically and let machines make important decisions for them, like determining jail sentences or hiring job candidates, any one of us may soon become a victim of an arbitrary and unjustifiable conclusion. It is not that computers are not incredibly useful; they allow us to do in minutes what might take a lifetime without them. The point is that, while current AI is artificial, it is not intelligent.

The Illusion of Intelligence

Over the years I have learned a tremendous amount from Gary Smith’s books and his way of thinking. It seems like a strange compliment but he is deeply familiar with randomness. He knows how random variables cluster, how long streaks can be expected to continue, and what random walks look like. He can examine a seemingly interesting statistical fluke in the data and conclude “you would find that same pattern with random numbers!” and then prove it by running a simulation. He uses this tactic often in his books and it is extremely effective. How can you claim that a pattern is meaningful when he just created it out of thin air?

The AI Delusion begins with a painful example for the political left of the United States. Smith points a finger at the over-reliance on automated number-crunching for the epic failure of Hillary Clinton’s presidential campaign in 2016. Clinton had a secret weapon: a predictive modeling system. Based on historical data, the system recommended campaigning in Arizona in an attempt for a blowout victory while ignoring states that Democrats won in prior years. The signs were there that the plan needed adjusting: her narrow victory over Bernie Sanders, the enthusiastic crowds turning out for Trump, and the discontent of blue-collar voters who could no longer be taken for granted. However, since her computer system did not measure those things, they were considered unimportant. Clinton should have heeded the advice of sociologist William Bruce Cameron: “not everything that can be counted counts, and not everything that counts can be counted.” Blindly trusting machines to have the answers can have real consequences. When it comes to making predictions about the real world, machines have blind spots, and we need to watch for them.

In contrast, machines are spectacular at playing games; they can beat the best humans at practically every game there is. Games like chess were traditionally considered proxies for intelligence, so if computers can crush us, does that mean that they are intelligent? As Smith reviews various games, he shows that the perception that machines are smart is an illusion. Software developers take advantage of mind-boggling processing speed and storage capabilities to create programs that appear smart. They focus on a narrow task, in a purified environment of digital information, and accomplish it in a way that humans never would. Smith points out the truth behind the old joke that a computer can make a perfect chess move while it is in a room that is on fire; machines do not think, they just follow instructions. The fact that they’re good at some things does not mean they will be good at everything.

In the early days of AI, Douglas Hofstadter, the author of the incredibly ambitious book Gödel, Escher, Bach: An Eternal Golden Braid, tackled the seemingly impossible task of replicating the way a human mind works. He later expressed disappointment as he saw the development of AI take a detour and reach for the tops of trees rather than the moon:

To me, as a fledgling [artificial intelligence] person, it was self-evident that I did not want to get involved in that trickery. It was obvious: I don’t want to be involved in passing off some fancy program’s behavior for intelligence when I know that it has nothing to do with intelligence.

A New Test for AI

The traditional test for machine intelligence is the Turing Test. It essentially asks the question: “Can a computer program fool a human questioner into thinking it is a human?” Depending on the sophistication of the questioner, the freedom to ask anything at all can pose quite a challenge for a machine. For example, most programs would be stumped by the question “Would flugly make a good name for a perfume?” The problem with this test is that it is largely a game of deception. Pre-determined responses and tactics, such as intentionally making mistakes, can fool people without representing any useful advance in intelligence. You may stump Siri with the ‘flugly’ question today, but tomorrow the comedy writers at Apple may have a witty response ready: “Sure, flidiots would love it.” This would count as the trickery Hofstadler referred to. With enough training, a program will pass the test but it would not be due to anything resembling human intelligence; it would be the result of a database of responses and a clever programmer who anticipated the questions.

Consider Scrabble legend Nigel Richards. In May 2015, Richards, who does not speak French, memorized 386,000 French words. Nine weeks later he won the first of his two French-language Scrabble World Championships. This can provide insight into how computers do similarly amazing things without actually understanding anything. Another analogy is the thought experiment from John Searle in which someone in a locked room receives and passes back messages under the door in Chinese. The person in the room does not know any Chinese; she is just following computer code that was created to pass the Turing Test in Chinese. If we accept that the person in the room following the code does not understand the questions, how can we claim that a computer running the code does?

A tougher test to evaluate machine intelligence is the Winograd Schema Challenge. Consider what the word ‘it’ refers to in the following sentences:

I can’t cut that tree down with that axe; it is too thick

I can’t cut that tree down with that axe; it is too small.

A human can easily determine that, in the first sentence, ‘it’ refers to the tree while, in the second, ‘it’ is the axe. Computers fail these types of tasks consistently because, like Nigel Richards, they do not know what words mean. They don’t know what a tree is, what an axe is, or what it means to cut something down. Oren Etzioni, a professor of computer science, asks “how can computers take over the world, if they don’t know what ‘it’ refers to in a sentence?”

One of my favorite surprises from the book is the introduction of a new test (called the Smith Test of course) for machine intelligence:

Collect 100 sets of data; for example, data on U.S. stock prices, unemployment, interest

rates, rice prices, sales of blue paint in New Zealand, and temperatures in Curtin,

Australia. Allow the computer to analyze the data in any way it wants, and then report the statistical relationships that it thinks might be useful for making predictions. The

computer passes the Smith test if a human panel concurs that the relationships selected by the computer make sense.

This test highlights the two major problems with unleashing sophisticated statistical algorithms on data. One problem is that computers do not know what they have found; they do not know anything about the real world. The other problem is that it is easy, even with random data, to find associations. That means that, when given a lot of data, what computers find will almost certainly be meaningless. Without including a critical thinker in the loop, modern knowledge discovery tools may be nothing more than noise discovery tools.

It is hard to imagine how a machine could use trickery to fake its way through a test like this. Countless examples in the book show that even humans who are not properly armed with a sense of skepticism can believe that senseless correlations have meaning:

Students who choose a second major have better grades on average. Does this mean a struggling student should add a second major?
Men who are married live longer than men who are divorced or single. Can men extend their lifespans by tying the knot?
Emergency room visits on holidays are more likely to end badly. Should you postpone emergency visits until the holidays are over?
Freeways with higher speed limits have fewer traffic fatalities. Should we raise speed limits?
Family tension is strongly correlated with hours spent watching television. Will everyone get along better if we ditch the TV?
People who take driver-training courses have more accidents than people who do not. Are those courses making people more reckless?
Students who take Latin courses score higher on verbal ability. Should everyone take Latin?

Many people incorrectly assume causal relationships in questions like these and unthinking machines would certainly do so as well. Confounding variables only become clear when a skeptical mind is put to use. Only after thinking carefully about what the data is telling us, and considering alternate reasons why there might be an association, can we come to reasonable conclusions.

Gary Smith’s specialty is teaching his readers how to spot nonsense. I’m reminded of a memorable speech from the movie My Cousin Vinny[2]:

Vinny: The D.A.’s got to build a case. Building a case is like building a house. Each piece of evidence is just another building block. He wants to make a brick bunker of a building. He wants to use serious, solid-looking bricks, like, like these, right? [puts his hand on the wall]
Bill: Right.
Vinny: Let me show you something.
[He holds up a playing card, with the face toward Billy]
Vinny: He’s going to show you the bricks. He’ll show you they got straight sides. He’ll show you how they got the right shape. He’ll show them to you in a very special way, so that they appear to have everything a brick should have. But there’s one thing he’s not gonna show you. [He turns the card, so that its edge is toward Billy]
Vinny: When you look at the bricks from the right angle, they’re as thin as this playing card. His whole case is an illusion, a magic trick…Nobody – I mean nobody – pulls the wool over the eyes of a Gambini.

Professor Smith endeavors to make Gambinis out of us all. After reading his books, you are taught to look at claims from the right angle and see for yourself if they are paper thin. In the case of The AI Delusion, the appearance of machine intelligence is the magic trick that is exposed. True AI would be a critical thinker with the capability to separate the meaningful from the spurious, the sensible from the senseless, and causation from correlation.

Data-Mining for Nonsense

The mindless ransacking of data and looking for patterns and correlations, which is what AI does best, is at the heart of the replication crisis in science. Finding an association in a large dataset just means that you looked, nothing more. Professor Smith writes about a conversation he had with a social psychologist at Sci Foo 2015, an annual gathering of scientists and writers at Googleplex. She expressed admiration for Daryl Bem, a social psychologist, who openly endorsed blindly exploring data to find interesting patterns. Bem is known, not surprisingly, for outlandish claims that have been refuted by other researchers. She also praised Diederik Stapel who has even admitted that he made up data. Smith changed the subject. The following day a prominent social psychologist said that his field is the poster-child for irreproducible research and that his default assumption is that every new study is false. That sounds like a good bet. Unfortunately, adding more data and high-tech software that specializes in discovering patterns will make the problem worse, not better.

To support the idea that computer-driven analysis is trusted more than human-driven analysis, Smith recounts a story about an economist in 1981 who was being paid by the Reagan administration to develop a computer simulation that predicted that tax revenue would increase if tax rates were reduced. He was unsuccessful no matter how much the computer tortured the data. He approached Professor Smith for help and was not happy when Smith advised him to simply accept that reducing tax rates would reduce tax revenue (which is, in fact, what happened). The effort to find a way to get a computer program to provide the prediction is telling; even back in the 80s people considered computers to be authoritative. If the machine says it, it must be true.

Modern day computers can torture data like never before. A Dartmouth graduate student named Craig Bennett used an MRI machine to search for brain activity in a salmon as it was shown pictures and asked questions. The sophisticated statistical software identified some areas of activity! Did I mention that the fish was dead? Craig grabbed it from a local market. There were so many areas (voxels) being examined by the machine that it would inevitably find some false positives. This was the point of the study; people should be skeptical of findings that come from a search through piles of data. Craig published his research and won the Ig Nobel Prize, which is awarded each year to “honor achievements that first make people laugh, and then make them think.” The lesson for the readers of AI Delusion is that anyone can read the paper and chuckle at the absurdity of the idea that the brain of a dead fish would respond to photographs but the most powerful and complex neural net in the world, given the same data, would not question it.

One of the biggest surprises in the book was the effective criticism of popular statistical procedures including stepwise regression, ridge regression, neural networks, and principal components analysis. Anyone under the illusion that these procedures will protect them against the downsides of data-mining is disabused of that notion. Professor Smith knows their histories and technical details intimately. Ridge regression, in particular, takes a beating as a “discredited” approach. Smith delivers the checkmate, in true Smithian style, by sending four equivalent representations of Milton Friedman’s model of consumer spending to a ridge regression specialist to analyze:

I did not tell him that the data were for equivalent equations. The flimsy foundation of ridge regression was confirmed in my mind by the fact that he did not ask me anything about the data he was analyzing. They were just numbers to be manipulated. He was just like a computer. Numbers are numbers. Who knows or cares what they represent? He estimated the models and returned four contradictory sets of ridge estimates.

Smith played a similar prank on a technical stock analyst. He sent fictional daily stock prices based on student coin flips to the analyst to see if it would be a good time to invest. The analyst never asked what companies the price history was from but became very excited about the opportunity to invest in a few of them. When Smith informed him that they were only coin flips, he was disappointed. He was not disappointed that his approach found false opportunities in noise but that he could not bet on his predictions. He was such a firm believer in his technical analysis that he actually believed he could predict future coin flips.

Automated stock-trading systems, similar to AI, are not concerned with real world companies; the buy and sell decisions are based entirely on transitory patterns in the price and the algorithms are tuned to the primarily meaningless noise of historical data. I wondered why, if stock trading systems are garbage, investment companies spend billions of dollars on trading centers as close to markets as possible. Smith explains this as well: they want to exploit tiny price discrepancies thousands of times per second or to front-run orders from investors and effectively pick-pocket them. This single-minded pursuit of a narrow goal without concern for the greater good is unfortunately also a feature of AI. The mindless world of high-frequency trading, both when it is profitable (exploiting others) and when it is not (making baseless predictions based on spurious patterns), serves as an apt warning about the future that awaits other industries if they automate their decision-making.

Gary Smith draws a clear distinction between post-hoc justification for patterns found rummaging through data and the formation of reasonable hypotheses that are then validated or refuted based on the evidence. The former is unreliable and potentially dangerous while the latter was the basis of the scientific revolution. AI is built, unfortunately, to maximize rummaging and minimize critical thinking. The good news is that this blind spot ensures that AI will not be replacing scientists in the workforce anytime soon.

There Are No Shortcuts

If you have read other books from Gary Smith, you know to expect many easy-to-follow examples that demonstrate his ideas. Physicist Richard Feynman once said “If you cannot explain something in simple terms, you don’t understand it.” Smith has many years of teaching experience and has developed a rare talent for boiling ideas down to their essence and communicating them in a way that anyone can understand.

Many of the concepts seem obvious after you have understood them. However, do not be fooled into believing they are self-evident. An abundance of costly failures have resulted from people who carelessly disregarded them. Consider the following pithy observations…

We think that patterns are unusual and therefore meaningful.

Patterns are inevitable in Big Data and therefore meaningless.

The bigger the data the more likely it is that a discovered pattern is meaningless.

You see at once the danger that Big Data presents for data-miners. No amount of statistical sophistication can separate out the spurious relationships from the meaningful ones. Even testing predictive models on fresh data just moves the problem of finding false associations one level further away. The scientific way is theory first and data later.

Even neural networks, the shining star of cutting edge AI, are susceptible to being fooled by meaningless patterns. The hidden layers within them make the problem even worse as they hide the features they rely on inside of a black box that is practically impossible to scrutinize. They remind me of the witty response from a family cook responding to a question from a child about dinner choices: “You have two choices: take it or leave it.”

The risk that data used to train a neural nets is biased in some unknown way is a common problem. Even the most sophisticated model in the world could latch on some feature, like the type of frame around a picture it is meant to categorize, and become completely lost when new pictures are presented to it that have different frames. Neural nets can also fall victim to adversarial attacks designed to derail them by obscuring small details that no thinking entity would consider important. The programmers may never figure out what went wrong and it is due to the hidden layers.

A paper was published a couple days ago in which researchers acknowledged that the current approaches to AI have failed to come close to human cognition. Authors from DeepMind, as well as Google Brain, MIT, and the University of Edinburgh write that “many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches.”[3] They conclude that “a vast gap between human and machine intelligence remains, especially with respect to efficient, generalizable learning.”

The more we understand about how Artificial Intelligence currently works, the more we realize that ‘intelligence’ is a misnomer. Software developers and data scientists have freed themselves from the original goal of AI and have created impressive software capable of extracting data with lightning speed, combing through it and identifying patterns, and accomplishing tasks we never thought possible. In The AI Delusion, Gary Smith has revealed the mindless nature of these approaches and made the case that they will not be able to distinguish meaningful from meaningless any better than they can identify what ‘it’ refers to in a tricky sentence. Machines cannot think in any meaningful sense so we should certainly not let them think for us.

[1] Guizzo, Erico. “IBM’s Watson Jeopardy Computer Shuts Down Humans in Final Game.” IEEE Spectrum: Technology, Engineering, and Science News. February 17, 2011. Accessed November 05, 2018. https://spectrum.ieee.org/automaton/robotics/artificial-intelligence/ibm-watson-jeopardy-computer-shuts-down-humans.

[2] My Cousin Vinny. Directed by Jonathan Lynn. Produced by Dale Launer. By Dale Launer. Performed by Joe Pesci and Fred Gwynne

[3] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. “Relational inductive biases, deep learning, and graph networks.” arXiv preprint arXiv:1806.01261, 2018.