COVID-19 “Game Changer”? Test it, but don’t bet on it.

According to a WSJ article, doctors in France, South Korea, and the U.S. are using hydroxychloroquine to treat COVID-19 “with success” and it says that we don’t have time to wait for confirmation. It refers to this study, stating “…researchers in France treated a small number of patients with both hydroxychloroquine and a Z-Pak, and 100% of them were cured by day six of treatment. Compare that with 57.1% of patients treated with hydroxychloroquine alone, and 12.5% of patients who received neither.” That sounds incredibly promising, and while the article does mention the potential downsides of the shortage of hydroxychloroquine and peddling false hope, it clearly recommends using the treatment now rather than waiting.

A shortage of the drug may not seem like a big downside until you realize that it’s actually being used by people (including my aunt) for conditions that the drug has actually been clinically proven to treat. (Update: it looks like Israel has got my aunt covered.) As for the French study, unfortunately, if you look at it closely, it is practically a poster child for the types of research that can’t be replicated. There’s no randomized control. There are a small number of treated patients (only twenty). It’s a hot field of study.

To see why these things matter, you may want to read one of the most downloaded papers in history from Dr. John Ioannidis titled “Why Most Published Research Findings Are False.” Its popularity is due to the fact that it addresses the replication crisis in science head-on and provides warning signs to watch out for before deciding to take new research seriously. In short, they are…

Small sample sizes. Have you ever noticed that if you rank cities by crime rate or cancer rate, that the ones at the top and the bottom of the list always have very small populations? This is because it’s much easier for small sets of data to show extreme results. In the same way you wouldn’t be convinced if someone says they’re the best shooter in basketball because they made 100% of their shots, your first response to the report that 100% of patients were cured of COVID-19 after 6 days with a combination of Z-Pak and hydroxychloroquine shouldn’t be “there’s a cure!”, it should be “wow, not many patients in that study.” If a study has a surprising result and a small sample size, be skeptical.

Small effect sizes. Effect sizes tend to be exaggerated, so if there’s a small effect size being reported, there’s a decent chance the real effect size is zero. In the case of the French study, the effect size is huge, perhaps suspiciously so. Only 2 of the 16 patients in the control group recovered in 6 days, while 14 of the 20 in the treatment group did. This seems overwhelmingly convincing until you see that the control group was not randomly chosen and are extremely different from the treatment group (on average, they were 14 years younger, so who knows how different they were on important features like how widespread the infection was on day one).

Also, even if we assume that the treatment and control groups were comparable, the claimed p-value of 0.001 is unlikely to be accurate. It’s calculated correctly, given the difference between control (12.5% recovering) and the treated patients (70% recovering), but ignores the fact that the treatment group had some patients removed from the analysis: “Six hydroxychloroquine-treated patients were lost in the follow-up during the survey because of early cessation of treatment. Reasons are as follows: three patients were transferred to intensive care unit…one patient stopped the treatment…because of nausea…one patient died…” Wait, what? Can we not assume that patients who went to the ICU or died would still have the virus if they were tested? Instead of just testing whether or not the drug removes the virus from patients still kicking, shouldn’t the question be whether or not the drug helps patients leave the hospital alive when all is said and done? There was also a patient counted as virus free who was found to have the virus two days later (presumably due to a false negative test). To be fair, even when I add back in the five patients I think should not have been removed from the treatment group, the p-value is still 0.01, but the filtering out of patients with clearly bad outcomes from the treatment group is not comforting.

Many tested relationships. If there are many failures that are disregarded, you can be pretty sure that the successes occurred due to chance alone. Fortunately, the French study doesn’t appear to be doing this. They mention that the combination of Hydroxychloroquine and Z-Pak had a 100% success rate, but also don’t hide the fact that it was only tried on a total of six patients. The overall focus remains on the big picture of hydroxychloroquine vs. COVID-19 as originally intended.

Flexibility in analysis. This one is very similar because it has to do with trying many different ideas before coming to a conclusion. Economist Ronald Coase summed this up nicely with the saying: “If you torture data long enough, it will confess.” There does seem to be a bit of data torture going on in the French study. These were described well and enumerated in an email from a friend, Dr. Adam Chapweske, yesterday…

  1. Any patient whose illness becomes severe during the trial is excluded from final analysis. So it seems that a study designed to determine viral clearance explicitly excludes anyone who ends up having a really high viral load at some point in the study (assuming severe illness indicates high viral load). This ended up excluding 4 patients receiving hydroxychloroquine from the final analysis (in addition to two others lost to follow up for other reasons) and none from the control group.
  2. Their registered secondary endpoints include clinical outcomes, but they don’t include these in their reported endpoints. As mentioned above, several patients receiving hydroxychloroquine either died or required transfer to an ICU, whereas no patients in the control group did. This makes me wonder about the other clinical data they originally planned on reporting but ad hoc decided not to mention. It’s particularly concerning since the authors themselves make very strong clinical recommendations based on their study.
  3. Best I can tell, their decision to report on results early (i.e., prior to completing enrollment or completing their registered primary endpoint) was also ad hoc.
  4. Their registered design does not mention azithromycin, yet they report on outcomes involving that drug and even include it in the title of their paper and in their results. Given they were not actually studying azithromycin, it would have been fine to mention the effect in the discussion section as a possible intervention for future study but they shouldn’t give the impression that it was in some meaningful sense “studied”. 
  5. The primary endpoint was virological clearance, but the baseline viral load for the control group is not given so we don’t know if the two groups are comparable with respect to this parameter. This is especially important in light of the small sample size, differences in disease (upper respiratory tract infection vs lower respiratory tract infection) and demographic and geographical differences between the two groups. 
  6. Virological measurements were reported differently for the two groups as well, which suggests to me that there were differences in the way they were tested. 

Financial incentives. We all know to watch out for this one. Financial incentives are incredibly powerful and unfortunately, many people value money more than their scientific integrity. I didn’t see any reason to suspect that the researchers would benefit personally by promoting their recommended drug. They’re just reporting what they truly believe is a promising result.

And the last one: A hot field of study. If the field is hot, there is a race to publish, and with enough researchers trying different things, it’s almost certain that someone will find something somewhere that is statistically significant. This is like collective p-hacking. Rather than one researcher trying many things, you have many researchers trying one thing, and the result is the same: unreliable results. Studying the effect of drugs on COVID-19 is clearly a hot field of study. So prepare yourselves for several false positives, even from more scientifically rigorous studies than this, before we find an effective treatment. In the meantime, keep experimenting. And I’m begging you: please use a randomized control.

Update (3/23/2020): WHO announced a megatrial that will test the four most promising treatments for COVID-19. The WHO scientific panel originally wasn’t going to include the “game changer” drug, but decided to test it due to the attention its getting. According to Susanne Herold, an expert on pulmonary infections, “Researchers have tried this drug on virus after virus, and it never works out in humans. The dose needed is just too high.” Even though it doesn’t seem likely to work, I am happy to see that it was included in the megatrial. Now that the rumors are out there and people are scrambling for it, some people are inevitably going to find out the hard way that hydroxychloroquine might do more harm than good. It’s better to give it a rigorous test and provide people with solid answers rather than speculation.

jaycordes.com

Author: Jay Cordes

Jay Cordes is a data scientist and co-author of "The Phantom Pattern Problem" and the award-winning book "The 9 Pitfalls of Data Science" with Gary Smith. He earned a degree in Mathematics from Pomona College and more recently received a Master of Information and Data Science (MIDS) degree from UC Berkeley. Jay hopes to improve the public's ability to distinguish truth from nonsense and to guide future data scientists away from the common pitfalls he saw in the corporate world. Check out his website at jaycordes.com or email him at jjcordes (at) ca.rr.com.