That's the thing. P values don't prove that anything must be. They simply say that if rerunning the experiment again, it would be surprising to get a different result. Conversely, if you don't find "statistical significance" it definitely doesn't mean there isn't a difference. In practice, it might (often) mean the study didn't have enough samples to find a relatively small effect, but the layperson making decisions (do I allow right turn on red or is that dangerous?) may not get that nuance. A book that really helped clarify my thinking on this is _Statistics Done Wrong_ by Alex Reinhart.
Edit: remove "interpret" from last sentence to clarify
Yes, this immediately reminded me of this book. To add an example (from the book): after implementing turn-on-red in a few places, a study was conducted to check if it increased the rate of accidents. The study found no statistically significant effect, so turn-on-red was rolled out statewide. Unfortunately, the study failed to detect an effect not because there was no effect, but because the sample was too small to confirm that the effect was statistically significant. Now that we have more data, it turns out that turn-on-red does increase the rate of accidents.
> They simply say that if rerunning the experiment again, it would be surprising to get a different result.
Not really. A low p-value says that it was surprising to get the result that you got, assuming that the null hypothesis is true. And if the null hypothesis is true it would be surprising to get again the same result (i.e. a result as extreme). If the null hypothesis is not true, the result would not be so surprising (or maybe more, if the true effect is in the “wrong” direction).
The result we got gives some evidence for the null hypothesis being false, but if the null hypothesis was very very likely to be true before it may still be very likely to be true afterwards. In that case it wouldn’t be surprising to get a different result if the experiment is performed again.
Illustration: I roll a die three times. I get three ones. P<0.01 (for the null hypothesis of a fair die and the two-tailed test on the average). This is not simply saying that if I roll the die three times again it would be surprising to get something other than ones.
I roll a die three times. I get three ones. P<0.01 (for the null hypothesis of a fair die and the two-tailed test on the average).
Hmm. At a glance, that doesn't seem right. Yes, the chances of rolling 3 1's is 1/(6^3), but if we only rolled once and got a single 1, we wouldn't have any reason to suspect that the die was unfair. So maybe we should only consider the second two repetitions, and conclude with p ~ .03 that the die is unfair? Otherwise, consider the case that we rolled a 1, 5, 2 --- certainly we shouldn't use this series of non-repeated outcomes as p < .01 evidence of an unfair die?
If the die is fair, the average score will be 3.5. One can define a test based on that value and reject the null hypothesis when the average score is too low or too high.
The sampling distribution for the average can be calculated and for three rolls the extreme values are 1 (three ones) and 6 (three sixes) which happen with probability 1/216 each. Getting three ones or three sixes is then a p=0.0093 result.
You raise a valid point. This is clearly not the best test for detecting unfair dice, because for a die which has only two equally probable values 3 and 4 we would reject the null hypothesis even less often than for a fair die! (In that case, the power would be below alpha, which is obviously pretty bad.)
That's not what OP is saying though. Let's say p=.001 - That means we're confident the results are not due to chance.
If we repeat the experiment and get a different result, then we need to be looking into confounding variables and testing methodology. Just because the P-value is low doesn't mean there's no fundamental flaw in the experiment used to find the p-value.
> how likely this result would be if the null hypothesis were true
It's how likely the result is to occur by chance if the null hypothesis is true. A positive result can occur for lots of other reasons even if the null hypothesis is true, and the p-value doesn't tell you anything about how likely you are to get a certain result if the null hypothesis is true (or false).
> It's how likely the result is to occur by chance if the null hypothesis is true. A positive result can occur for lots of other reasons even if the null hypothesis is true, and the p-value doesn't tell you anything about how likely you are to get a certain result if the null hypothesis is true (or false).
I don’t think your comment makes sense.
Given a parametric model and a particular value of the parameter (i.e. the null hypothesis) one can calculate the sampling distribution of the data.
Therefore under the null hypothesis the model gives a well-defined probability distribution for the data and you can tell how likely you are to get a certain result.
When we say “how likely you are to get a certain result if the null hypothesis is true”, one should understand “the null hypothesis is true” as “the data is generated by a process perfectly described by the model, including a particular value for the parameter”.
I agree that if the data is made up the results of the study and the statistical analysis based on the results will have no relation whatsoever with the fact that the null hypothesis was or wasn’t true.
The p-value tells you just how likely you are to get a certain (or more extreme) result if the data generating model is indeed correct and the null hypothesis is true.
We agree that the p-value doesn’t tell you anything about how likely it is that the study was based on fabricated data, or how likely it is that the model is correct or how likely it is that the null hypothesis is true.
The p-value doesn’t tell us anything about the real world. It’s a probability conditional on a hypothetical model.
Edit: remove "interpret" from last sentence to clarify