I truly believe that Bayesian inference is the statistics of the 21st century. Recent advances in MCMC (e.g., NUTS, Stan [1]) and variational inference (e.g., ADVI [2], VAE [3], etc.) + more computing power than ever promise a near future in which Bayesian inference is the default inference engine.
Prior distribution is a beautiful and logical mechanism for adding regularization, domain-specific knowledge to our model.
But note, the last paragraph of TFA cautions against use of a Bayesian prior in cases where it is not well supported by actual hard prior information.
It is very hard to validate a given choice of a prior in many applications. E.g., if I claim one prior, and another investigator claims a sharper one, it can be very difficult to decide who is right.
If the prior does not wash out due to lots of data, this indicates a serious and fundamental problem.
> It is very hard to validate a given choice of a prior in many applications. E.g., if I claim one prior, and another investigator claims a sharper one, it can be very difficult to decide who is right.
Both prior and likelihood are our model's assumptions. So, the prior validation problem is similar to the likelihood validation problem. To check a Bayesian model or any model, we need to bring the model out of the formal world, to the real world for validation.
Prior predictive simulation method, which generates random data points from the prior, is a good heuristic to check if the prior is NOT plausible.
> ...the prior validation problem is similar to the likelihood validation problem...
But priors can be much harder.
Say I’m trying to estimate a wind speed from the blade velocity of a windmill. I can bring a more accurate wind speed sensor to calibrate the windmill against the wind speed, perhaps aided by basic physics. This is the likelihood portion.
But what should the prior be? The typical speed at that time of day? The speed in January? The speed on cloudy days? I have to have a crisp number — a full distribution actually, accurate out to the tails. I really have very little grounding for choosing that distribution.
I started out just wanting to relate the wind speed to some data in a rather concrete way, and now I’ve been roped in to choosing a crisp distribution for a rather amorphous state of nature.
This is a deep problem.
We can sharpen the problem. Say my number and yours are different. How do we tell who is right?
One can try a different tack: I’m being stubborn. The prior will mostly wash out in any well-posed problem, or else why try to solve it? But now we’re back to frequentism, just looking at the likelihood.
HN tends to invoke the Bayesian framework as a complete solution to inference — I’m just trying to demonstrate that there are problems with that approach.
> I can bring a more accurate wind speed sensor to calibrate the windmill against the wind [...] But what should the prior be? [...] I have to have a crisp number — a full distribution actually, accurate out to the tails.
What would you do when the sensor returns negative wind speeds due to noise or errors?
The wind speed cannot be negative, or greater than speed of light. An expert in windmill can narrow down the prior distribution much more.
> But priors can be much harder.
Choosing a prior is hard because it requires thinking explicitly about the problem and its assumptions.
It merely exposes our lack of expertise on the problem.
When you're lazy, you can pick a uniform prior Uniform(0, c) and call it a day.
> We can sharpen the problem. Say my number and yours are different. How do we tell who is right?
Forget about prior, say, we have 2 sensors which output two slightly different wind speeds. Which wind speeds is right?
The lower one or the average speed.
This is a deep philosophical problem. However, it's a problem for any model.
> The prior will mostly wash out in any well-posed problem.
I don't think so. Any well-posed problem should include the prior, or else how can we tell: 2 data points is not enough?
> HN tends to invoke the Bayesian framework as a complete solution to inference [...]
Bayesian framework is indeed a complete solution to inference in a formal/logical sense.
However, I agree that there are many problems in applying Bayesian framework to real world problems that requires
serious thinking about our assumptions on the problem.
"Bayesian framework is indeed a complete solution to inference in a formal/logical sense."
Bradley Efron, in TFA, begs to disagree:
"I wish I could report that this resolves the 250-year controversy and that it is now safe to always employ Bayes’ theorem. Sorry. My own practice is to use Bayesian analysis in the presence of genuine prior information; to use empirical Bayes methods in the parallel cases situation; and otherwise to be cautious when invoking uninformative priors. In the last case, Bayesian calculations cannot be uncritically accepted and should be checked by other methods, which usually means frequentistically."
My perspective is that the problem of deciding the "correct" prior is a human problem because the human brain is a messy machine. An artificial intelligence which has full access to its own code and its memory in perfect detail will know precisely what it knows about a certain situation, and therefore can estimate a prior that accurately reflects this knowledge.
In the windmill example, the AI can quickly collect all it has in its memory about blade speeds, and maybe spend a self-imposed X min computational time to make a best guess for the prior speed distribution.
Humans can't do this, so we have gone down a philosophical rabbit hole of figuring out this "prior problem", when the real problem is that we are just messy informal thinkers.
> How do we tell who is right?
You are fundamentally conceptually mistaken here. There is nothing right or wrong with two agents disagreeing on the prior. The different prior reflects the before experiment knowledge of the two agents. I am a windmill engineer, so my priors will be much more narrow than yours, who has never seen a windmill outside of a hollywood movie.
To me, and I'm not quite the expert that perhaps you are, but to me it seems like Bayesian inference is still in a better spot here because the priors are part of an explicit quantification of bias and assumption in a model.
Much havoc has befallen the scientific world because of the hidden assumptions of frequentist techniques with poorly understood preconditions, even for rather basic models. And there isn't much anyone can do about that save move to ever more complicated models.
Do you know _why_ KL divergence is minimized? I get that it gives a lower bound on the marginal likelihood, which is cool, but is that it? What are the alternatives?
KL divergence is motivated nicely from an information/coding theory viewpoint. It's very closely related to Shannon-von Neumann entropy [1], and KL(P||Q) characterizes the efficiency of a code designed for a model distribution P, when applied to reality which is actually represented by Q.
A lot of recent work focuses on the Wasserstein divergence [1] as an alternative. One advantage of Wasserstein over KL is that the Wasserstein metric provides better fit over the whole distribution instead of localizing on some specific regions, thereby preventing "mode collapse". This makes it a popular metric for training Generative Adversarial Networks (GANs).
Many samplers are based on HMC, it's a general class of samplers, not a specific algorithm. NUTS is a variation of HMC, as well as the paper I linked above.
'vanilla' HMC uses detailed balance to guarantee that the stationary distribution of the chain is the one you want, causing the process to behave like a random walk. So although the Hamiltonian bit of HMC lets you take these great big steps through state space, you end up retracing your steps quite a lot.
I strongly disagree. Bayesian inference is the only known self-consistent formal system for doing science, i.e. updating our belief system about the world based on the current evidence.
You are not supposed to agree on a prior. That's one of the fundamental insights of the Bayesian inference framework. That different people know different things about a given situation, so they initially disagree, and therefore their priors are different. This should not be surprised. People disagree all the time, and the Bayesian framework just formalizes it.
The different people can then go on and do lots of experiments, collect lots of data and update their priors to posteriors. And the guarantee is that as long as each person's prior was not a mathematically weird function, after enough evidence has been collected all these people will have the same posterior function i.e. they will agree [1].
Exactly. This is why (in my view) scientific research should focus on presenting evidence, not on arguing for certain posteriors or priors. The meta-science process then steers Bayesian beliefs correctly and the evidence-gathering process efficiently. (edit: I see now the top post here on this article also discusses this point.)
Actually, it's the strength of Bayesian inference that these assumptions are made apparent.
Coming to consensus on priors is the same process for arriving at consensus that all scientific inquiry must engage in. Anyone who says frequentist methods somehow more accurately represent an underlying reality are pulling a fast one.
Hmm, I think part of the question is where this debate and consensus should occur. I believe in firmly separating rigorous science from opinion and belief. To me, it follows that scientific research should focus on presenting evidence and leave it to Bayesian individuals to update their beliefs based on this evidence. Similarly I think argument or discussion about priors is not in scope for scientific research (except maybe a bit in the "motivation" subsection). (edit: I see now the top post here on this article also discusses this point.)
I can't help but feel like this is a fundamental definition problem. Science is not actually distinct from consensus forming. Science does not work with raw facts, it forms models based off human observations which are themselves a kind of consensus.
Bayesian research is just more honest about what's already the case.
I wouldn't agree, at any given time there is a lot of disagreement and non-consensus in given fields. So we need to new research to gather additional evidence. If every research paper tried to argue for a particular prior and posterior, rather than just gathering evidence, we would never make progress toward consensus either...
> I wouldn't agree, at any given time there is a lot of disagreement and non-consensus in given fields.
That's precisely my point. The act of presenting and refining research IS the act of building that consensus.
My statement here is not a novel thought. It's pretty much the modern philosophy of science for over a decade.
> So we need to new research to gather additional evidence.
This is simply data gathering though. Every approach starts here. I'm not sure why you suggest that people using Bayesian approaches to analysis are somehow forbidden from being informed by data (or informing priors by data).
That's exactly the same process folks use when selecting non-bayesian models. They don't spring from absolute truth, they're selected as well.
> If every research paper tried to argue for a particular prior and posterior
Given the replication crisis that's in part due to mis-application of existing models along with a lack of rigor in data collection, having research focus more tightly on the methodology for presenting data and conclusions doesn't seem like a bad outcome at all.
There are several libraries. PyMC4 [1], the next version of PyMC3, will introduce TF as a backend. Tensorflow probability [2] from Google. PYRO [3], from Uber, uses Pytorch backend.
Prior distribution is a beautiful and logical mechanism for adding regularization, domain-specific knowledge to our model.
[1] Stan, a platform for statistical modeling http://mc-stan.org/
[2] Automatic Differentiation Variational Inference https://arxiv.org/abs/1603.00788
[3] Auto-Encoding Variational Bayes https://arxiv.org/abs/1312.6114