I'm sure it's only a matter of weeks (months at most) before every "news" publication in this country runs a story on the pitiful state of statistical rigor in research. Most of them will probably point to one of a handful of studies that pick apart the various statistical methods used in studies. None of them will point out the fundamental statistical misunderstanding that undergirds 99% of the medical literature: the p-value and what it means. The p-value is so misunderstood, and the misunderstanding so widespread, that my medical school class was taught the wrong definition.
Here's what the p-value is not: "The probability that the null-hypothesis was true." I didn't choose this definition out of thin air to beat up on, it was the correct answer on a test I took asking, "Which of these is the definition of a p-value?" Beyond that it's what most people think a p-value is.
These discussions get bogged down in a lot of technical statistics jargon and nonsense, so I am going to try and put some Real Things into the p-value framework in order to explain what a p-value actually is. Bear with me, I will make it as painless as possible.
The first step to any experiment is to come up with a hypothesis, so here's one: "The NFL has been systematically rigging the Super Bowl's opening coinflip in favor the NFC since SB XXXII." What does this hypothesis mean to us in experimental terms? It means that the NFC will win the coinflip more often than it ought to. This works well since we're working with a coinflip we know that both winning and losing it ought to occur 50% of the time. Since our hypothesis is that these coinflips deviate from expected behavior, we can use the expected behavior as our null hypothesis.
What has actually happened? The NFC has won the last 14 coin flips. The probability of that happening is a mere 0.006% (p=0.00006). In this simple example the probability is the p-value, there are many statistical machinations that have to take place before you get to the p-value in most studies, but that's not really important. What matters is the meaning of that end value.
What have we determined regarding our null hypothesis and our p-value? We can say the experimental data (the history of coin flips) indicate there is only a 0.006% percent chance of observing this sequence of events if our null hypothesis (i.e. our model) is correct. An important note here, the caveat at the end is almost universally omitted, but its logic is part and parcel of the p-value's calculation, so its effect persists whether it is mentioned or not. What this means is that the p-value is calculated presuming that the null hypothesis is true. The result of this is that the p-value cannot simultaneously be the probability that the null hypothesis is false.
This distinction is critical because the next step taken by most scientists is to transforming our 0.006% chance if the null hypothesis is true into a 99.994% chance that the null hypothesis is false. This semantic voodoo even has a name, it's called the transposed conditional.
The reason that the data must be illogically contorted in this fashion is that the calculation underlying p-values cannot make the claims its authors believe it can. Consider the results of our Super Bowl coin flip example; we are left with two conclusions: that there truly is a pro-NFC conspiracy regarding the opening coin flip, or that we have merely witnessed a series of events statistically very unlikely to occur. Our p-value gives us no way to distinguish between the two.
Let me demonstrate this point with our Super Bowl example's converse. On Twitter I asked for someone to give me a random series of 14 coin flip results and caidid obliged. For this experiment our hypothesis is that I have rigged the flips against of caidid and the null hypothesis is that the coin is fair. Using caidid's series, and a PHP script I wrote to give me a random series of coin flip results (its output: HTTTHHTTHTHHHH) we can calculate our results. Caidad (only) got six correct. If our null hypothesis is true we ought to see six correct answers 18.3% of the time and the probability of caidid getting six or more correct is 78.8%.
Again we are left with the a conundrum. Did we witness a relatively likely outcome, or was the game truly rigged? Science would say that since p>0.05 we reject our test hypothesis and presume the null hypothesis to be true. In this case that ought to leave you uncomfortable. After all, you don't know that I didn't rig the results against caidid and none of this data can exonerate me. The reason probabilities are unatisfactory here is that some other knowledge is needed, e.g. am I the type of person who would invent results? How many times did I run the coin flip script before choosing a series? etc.
To put it more starkly, let's take a look at the actual definition of a p-value. The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. This definition makes the limitations much more clear, one cannot conclusively say whether the hypothesis is true based only on a probability that assumes it is not.
Consider a final example. My dog cries constantly when he needs to be let out to go to the bathroom. When he does not need to go he cries 10% of the time. Assume for the null hypothesis that the dog does not have to go to the bathroom, the probability of observing my dog crying (given that hypothesis) is less than 10 percent. If you then actually do observe my dog crying, what is the likelihood that the null hypothesis is incorrect he really does need to be let out?
Again, it's impossible to figure from that because we're missing critical pieces of information. Without knowing when he was last let out, and whether he went to the bathroom then, it's impossible to decide whether the hypothesis is true or false.
What's sad about this is that all doctors are familiar with the school of Bayesian probability which is equipped to make the claims that frequency based p-values are not. They have to be, because it is the underlying principle in every laboratory test result. After all we need to have some real world data to justify the expenses and side effects of screening and diagnostic tests... Oh wait...