Bayesian statistics for confused data scientists
165 points by speckx 5 days ago | 54 comments

statskier 21 hours ago
I went through grad school in a very frequentist environment. We “learned” Bayesian methods but we never used them much.

In my professional life I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods. I’m sure other people’s experiences are different depending on the problems you gravitate towards.

In fact, I tend to get pretty frustrated with Bayesian approaches because when I do turn to them it tends to be in situations that already quite complex and large. In basically every instance of that I’ve never been able to make the Bayesian approach work. Won’t converge or the sampler says it will take days and days to run. I can almost always just resort to some resampling method that might take a few hours but it runs and gives me sensible results.

I realize this is heavily biased by basically only attempting on super-complex problems, but it has sort of soured me on even trying anymore.

To be clear I have no issue with Bayesian methods. Clearly they work well and many people use them with great success. But I just haven’t encountered anything in several decades of statistical work that I found really required Bayesian approaches, so I’ve really lost any motivation I had to experiment with it more.

reply
nextos 21 hours ago
> I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods

Multilevel models are one example of problem were Bayesian methods are hard to avoid as otherwise inference is unstable, particularly when available observations are not abundant. Multilevel models should be used more often as shrinking of effect sizes is important to make robust estimates.

Lots of flashy results published in Nature Medicine and similar journals turn out to be statistical noise when you look at them from a rigorous perspective with adequate shrinking. I often review for these journals, and it's a constant struggle to try to inject some rigor.

From a more general perspective, many frequentist methods fall prey to Lindley's Paradox. In simple terms, their inference is poorly calibrated for large sample sizes. They often mistake a negligible deviation from the null for a "statistically significant" discovery, even when the evidence actually supports the null. This is quite typical in clinical trials. (Spiegelhalter et al, 2003) is a great read to learn more even if you are not interested in medical statistics [1].

[1] https://onlinelibrary.wiley.com/doi/book/10.1002/0470092602

reply
michaelbarton 4 hours ago
Curious what you might consider “adequate shrinking”?

Horshoe priors, partial pooling, something more?

I realize that might be highly subject

reply
nextos 3 hours ago
I guess this depends on the problem at hand.

But I was thinking about a typical hierarchical model with partial pooling and standard weakly informative priors.

reply
statskier 20 hours ago
I agree Bayesian approaches to multilevel modeling situations are clearly quite useful and popular.

Ironically this has been one of the primary examples of, in my personal experience, with the problems I have worked on, frequentist mixed & random effects models have worked just fine. On rare occasions I have encountered a situation where the data was particularly complex or I wanted to use an unusual compound probability distribution and thought Bayesian approaches would save me. Instead, I have routinely ended up with models that never converge or take unpractical amounts of time to run. Maybe it’s my lack of experience jumping into Bayesian methods only on super hard problems. That’s totally possible.

But I have found many frequentist approaches to multilevel modeling perfectly adequate. That does not, of course, mean that will hold true for everyone or all problems.

One of my hot takes is that people seriously underestimate the diversity of data problems such that many people can just have totally different experiences with methods depending on the problems they work on.

reply
nextos 20 hours ago
These days, the advantage is that a generative model can be cleanly decoupled from inference. With probabilistic languages such as Stan, Turing or Pyro it is possible to encode a model and then perform maximum likelihood, variational Bayes, approximate Bayesian inference, as well as other more specialized approaches, depending on the problem at hand.

If you have experienced problems with convergence, give Stan a try. Stan is really robust, polished, and simple. Besides, models are statically typed and it warns you when you do something odd.

Personally, I think once you start doing multilevel modeling to shrink estimates, there's no way back. At least in my case, I now see it everywhere. Thanks to efficient variational Bayes methods built on top of JAX, it is doable even on high-dimensional models.

reply
jmalicki 18 hours ago
Thank you for Lindley's paradox! TIL
reply
getnormality 19 hours ago
The evidence "actually supports the null" over what alternative?

In a Bayesian analysis, the result of an inference, e.g. about the fairness of a coin as in Lindley's paradox, depends completely on the distribution of the alternative specified in the analysis. The frequentist analysis, for better and worse, doesn't need to specify a distribution for the alternative.

The classic Lindley's paradox uses a uniform alternative, but there is no justification for this at all. It's not as though a coin is either perfectly fair or has a totally random heads probability. A realistic bias will be subtle and the prior should reflect that. Something like this is often true of real-world applicaitons too.

reply
_alternator_ 16 hours ago
Thank you. The main problem with Bayesian statistics is that if the outcome depends on your priors, your priors, not the data determine the outcome.

Bayesian supporters often like to say they are just using more information by coding them in priors, but if they had data to support their priors, they are frequentists.

reply
kgwgk 15 hours ago
If they were doing frequentist inference they wouldn’t be using priors at all and there is nothing frequentist in using previous data to construct prior distributions.
reply
uoaei 14 hours ago
Not true. In frequentist statistics, from the perspective of Bayesians, your prior is a point distribution derived empirically. It doesn't have the same confidence / uncertainty intervals but it does have an unnecessarily overconfident assumption of the nature of the data generating process.
reply
kgwgk 12 hours ago
Not true. In frequentist statistics, from the perspective of Bayesians and non-Bayesians alike, there are no priors.

—-

Dear ChatGPT, are there priors in frequentist statistics? (Please answer with a single sentence.)

No — unlike Bayesian statistics, frequentist statistics do not use priors, as they treat parameters as fixed and rely solely on the likelihood derived from the observed data.

reply
zozbot234 11 hours ago
There's always priors, they're just "flat", uniform priors (for maximum likelihood methods). But what "flat" means is determined by the parameterization you pick for your model. which is more or less arbitrary. Bayesians would call this an uninformative prior. And you can most likely account for stronger, more informative priors within frequentist statistics by resorting to so-called "robust" methods.
reply
_alternator_ 8 hours ago
First, there is not such thing as a ‘uninformative’ prior; it’s a misnomer. They can change drastically based on your paramerization (cf change of variables in integration).

Second, I think the nod to robust methods is what’s often called regularization in frequentist statistics. There are cases where regularization and priors lead to the same methodology (cf L1 regularized fits and exponential priors) but the interpretation of the results is different. Bayesian claim they get stronger results but that’s because they make what are ultimately unjustified assumptions. My point is that if they were fully justified, they have to use frequentist methods.

reply
kgwgk 7 hours ago
One standard way to get uninformative priors is to make them invariant under the transformation groups which are relevant given the symmetries in the problem.
reply
kgwgk 7 hours ago
It’s not true that “there are always priors”. There are no priors when you calculate the area of a triangle, because priors are not a thing in geometry. Priors are not a thing in frequentist inference either.

You may do a Bayesian calculation that looks similar to a frequentist calculation but it will be conceptually different. The result is not really comparable: a frequentist confidence interval and a Bayesian credible interval are completely different things even if the numerical values of the limits coincide.

reply
zozbot234 7 hours ago
Frequentist confidence intervals as generally interpreted are not even compatible with the likelihood principle. There's really not much of a proper foundation for that interpretation of the "numerical values".
reply
kgwgk 6 hours ago
What does “as generally interpreted” mean? There is one valid way to interpret confidence intervals. The point is that it’s not based on a posterior probability and there is no prior probability there either.
reply
kgwgk 7 hours ago
If you want to say that when you do a frequentist analysis which doesn’t include any concept of prior you get a result that has a similar form to the result of a completely different conceptually Bayesian analysis which uses a flat prior (definitely not “a point distribution derived empirically”) that may be correct. It remains true that there is no prior in the frequentist analysis because they are not part of frequentist inference at all.
reply
fny 11 hours ago
In clinical settings and situations where probabilities really matter, its a better fit.

I studied stats at Duke which is a Bayesian academy. Almost every problems come from regimes with small sample sizes. Given that Duke houses the largest academic clinical research organization globally, having a stats and biostats department with this bent is useful: samples are tiny in clinical trials compared to most big data settings.

The biggest problem with the whole Bayesian regime IMO is that as the data gets larger its selling point vanishes. If your data is big or is normal (mean-based statistics), a frequentest/bootstrapped CI approximates the Bayesian CI anyway.

Furthermore, many us work in settings where we're trying to sell toothpaste: we don't need the Bayesian guarantees that an insurer might.

reply
JHonaker 8 hours ago
I’m not sure what your professional experience is in, but as a counterpoint, I’ve never been in a situation where I hadn’t wished for a system I’m working with to already be in a Bayesian framework. Having said that, I only occasionally am building things from scratch instead of modifying existing systems, so I’m not always lucky enough to be able to work with them.

The pain points around getting a sampler/model pairing working in a reasonable timeframe is definitely a valid complaint. In my experience, inference methods in Bayesian stats are much less forgiving of poorly specified models (or said another way, don’t let you get away with ignoring important structural components of the phenomena of interest). A poorly performing model (in terms of sampler speed/mixing) is often a sign of a problem with the geometry of the parameter space. Frustratingly this can sometimes be a result of conceptually equivalent, but computationally different parameterizations (e.g. centered vs non-centered multi level effects).

The struggles are worth it IMO because it is helpful feedback that helps guide design, and the ease with which I can compute meaningful uncertainty bounds on pretty much any quantity of interest is invaluable.

reply
storus 21 hours ago
A large portion of generative AI is based on Bayesian statistics, like stable diffusion, regularization, LLM as a learned prior (though trained with frequentist MLE), variational autoencoders etc. Chain-of-thought and self-consistency can be viewed as Bayesian as well.
reply
jmalicki 17 hours ago
I feel like I'm a polyglot here but primarily a native frequentist thinker.

I've found Bayesian methods shine in cases of an "intractible partition function".

Cases such as language models, where the cardinality of your discrete probability distribution is extremely large, to the point of intractability.

Bayesians tend to immediately go to things like Monte Carlo estimation. Is that fundamentally Bayesian and anti-frequentist? Not really... it's just that being open to Bayesian ways of thinking leads you towards that more.

Reinforcement learning also feels much more naturally Bayesian. I mean Thompson sampling, the granddaddy of RL, was developed through a frequentist lens. But it also feels very Bayesian as well.

In the modern era, we have Stein's paradox, and it all feels the same.

Hardcore Bayesians that seem to deeply hate the Kolmogorov measure theoretic approach to probability are always interesting to me as some of the last true radicals.

I feel like 99% of the world today is these are all just tools and we use them where they're useful.

reply
jb1991 15 hours ago
When you are using something like Monte Carlo you’re probably using some method that’s more advanced than the Naïve Bayes, is that right?
reply
jmalicki 4 hours ago
I'm talking about, for something simple, the negative sampling in word2vec.

Or the temperature setting for an LLM etc.

reply
fumeux_fume 17 hours ago
Given your bias, why bother making this point on a thread about using Bayesian methods where they are applicable? Just seems like unconstructive negativity.
reply
jhbadger 22 hours ago
I think Rafael Irizarry put it best over a decade ago -- while historically there was a feud between self-declared "frequentists" and "Bayesians", people doing statistics in the modern era aren't interested in playing sides, but use a combination of techniques originating in both camps: https://simplystatistics.org/posts/2014-10-13-as-an-applied-...
reply
jmalicki 17 hours ago
I agree... I feel like "The Elements of Statistical Learning" was possibly one of the first "postmodern" things where "well, frequentist and Bayesian are just tools in the toolbox, we now know they're not so incompatible."

After Stein's paradox it became super hard to be a pure frequentist if you didn't have your head in the sand.

reply
therobots927 19 hours ago
That’s Bayesian propaganda
reply
jmalicki 18 hours ago
Huh? Are there really any pure frequentists post Stein's paradox? At least ones that are aware of it and maintain objections to fusing the fields?
reply
kgwgk 12 hours ago
> Are there really any pure frequentists post Stein's paradox?

What does that have to do with anything? If one cares about that using a shrinkage estimator is an option which maintains the frequentist purity.

reply
jmalicki 4 hours ago
There is some frequentist procedure there, but it seems hard to not recognize the deep connection to Bayesian statistics and wonder if you should begin to question your baseline assumptions. Since the entire justification for using a shrinkage estimator has a whole lot more in common with the foundations of Bayesian statistics than it does with the foundations of frequentist stats.

Purist frequentists using a shrinkage estimator looks a lot like heliocentric Ptolemic astronomy.

reply
therobots927 5 hours ago
Downvote me all you want. Bayesianism is misapplied much more frequently than frequentism. It just makes it way too easy to fudge p values. Sorry not sorry.
reply
jmalicki 4 hours ago
I do always laugh when I see a Bayesian object to p-values, then use a Bayesian procedure that is mathematically identical to treating p values as posterior probabilities.

Just saying the word "Bayesian" doesn't actually make it different

reply
oliver236 9 hours ago
Nice writeup. Something that clicked for me reading this is how much the prior/likelihood/posterior dynamic mirrors transfer learning in deep learning. The prior is basically your pre-trained weights: broad knowledge you bring to the table before seeing any task-specific data. The likelihood is your fine-tuning step. And the Bernstein-von Mises result at the end is essentially saying "with enough fine-tuning data, your pre-training washes out."

Obviously the analogy isn't perfect (priors are explicit and interpretable, pre-trained weights are not), but I think it's a useful mental model for anyone coming from an ML background who finds Bayesian stats unintuitive. Regularization being secretly Bayesian was the other thing that made it click for me. If you've ever tuned a Ridge regression lambda, you were doing informal prior selection.

reply
algolint 11 hours ago
The frequentist vs. Bayesian debate often becomes more about "what can I compute easily?" than "what is the correct mental model?". With tools like Stan and PyMC getting better, the "computational cost" argument is weakening, but the "intuition cost" remains high. Most people are naturally frequentists in their day-to-day reasoning, and switching to a mindset of "probability as a degree of belief" requires a significant cognitive shift that isn't always rewarded with better results in simple business or engineering contexts.
reply
NeutralCrane 6 hours ago
The exact opposite is true. Virtually everyone’s intuition is aligned with the Bayesian model. That intuition has to be hammered out of people in their stats classes because for decades frequentist approaches were computationally more feasible, even if they don’t align with how most humans interpret probability.
reply
algolint 4 hours ago
That's a fair challenge. My operational perspective is heavily anchored by the systems we run. From an engineering leadership standpoint, our entire observability stack-SLAs, p99 latencies, error budgets-is fundamentally frequentist. The cognitive shift I'm highlighting isn't purely mathematical; it's getting an on-call engineer to reason in distributions of confidence rather than binary threshold alerts. When a distributed system is degrading, the 'hammered-in' frequentist threshold is often the fastest path to mitigation.
reply
zaik 10 hours ago
I would argue the opposite is true. It takes a long time to beat the Bayesian thinking out of students when presenting them with a confidence interval: https://link.springer.com/article/10.3758/s13423-013-0572-3
reply
algolint 10 hours ago
That's a fair challenge, and the Morey et al. paper is a staple for a reason - it highlights that frequentist intervals are often 'answers to questions nobody asked.'

However, from an engineering lead's perspective, I find that while students might have a 'Bayesian intuition,' our industry-standard observability tools (Prometheus, etc.) are fundamentally frequentist. We define SLAs based on tail latency percentiles (p99), which are frequentist estimators.

The cognitive shift I'm referring to is moving from 'here is a threshold' to 'here is a distribution of possible truths' when building adaptive systems, like agentic orchestrators. In those cases, the overhead of a Bayesian approach (defining priors for every microservice latency, etc.) often loses out to the pragmatism of 'is the p99 stable?'. We trade theoretical correctness for operational speed and simplicity.

reply
bvan 9 hours ago
Nicely done. I have the same challenge with Bayesian stats and usually do not understand why there is such controversy. It isn’t a question of either/or, except in the minds of academics who rarely venture out into the real world, or have to balance intellectual purity with getting a job done.

In the very first example, a practitioner would consciously have to decide (i.e. make the assumption) whether the number of side on the die (n) is known and deterministic. Once that decision is made, the framework with which observations are evaluated and statistical reasoning applied will forever be conditional on that assumption.. unless it is revised. Practitioners are generally OK with that, whether it leads to ‘Bayesian’ or ‘frequentist’ analysis, and move on.

reply
fumeux_fume 16 hours ago
As a data scientist, I find applied Bayesian methods to be incredibly straightforward for most of the common problems we see like A/B testing and online measuring of parameters. I dislike that people usually first introduce Bayesian methods theoretically, which can be a lot for beginners to wrap their head around. Why not just start from the blissful elegance of updating your parameter's prior distribution with your observed data to magically get your parameter's estimate?
reply
oliver236 9 hours ago
can you explain what you're saying please?
reply
smokel 8 hours ago
This article made me enthusiastic to dive into Bayesian statistics (again). A quick search led me to Think Bayes [1], which also introduces the concepts using Python, and seems to have a little more depth.

[1] https://allendowney.github.io/ThinkBayes2/

reply
jrumbut 14 hours ago
The author makes a comparison to Haskell, which I think might be a little misleading.

Haskell is a little more complicated to learn but also more expressive than other programming languages, this is where the comparison works.

But where it breaks down is safety. If your Haskell code runs, it's more likely to be correct because of all the type system goodness.

That's the reverse of the situation with Bayesian statistics, which is more like C++. It has all kinds of cool features, but they all come with superpowered footguns.

Frequentist statistics is more like Java. No one loves it but it allows you to get a lot of work done without having to track down one of the few people who really understand Haskell.

reply
hawtads 18 hours ago
I think it would be interesting if frequentist stats can come up with more generative models. Current high level generative machine learning all rely on Bayesian modeling.
reply
jmalicki 18 hours ago
I'm not well versed enough, but what would a frequentist generative model even mean?

The entire generative concept implicitly assumes that parameters have probability distributions themselves that naturally give rise to generative models...

You could do frequentist inference on a generative model, sure, but generative modelling seems fundamentally alien to frequentist thinking?

reply
hawtads 18 hours ago
I am more familiar with Bayesian than frequentist stats, but given that they are mathematically equivalent, shouldn't frequentist stats have an answer to e.g. the loss function of a VAE? Or are generative machine learning inherently impossible to model for frequentist stats?

Though if you think about it, a diffusion model is somewhat (partially) frequentist.

reply
jmalicki 17 hours ago
I guess you have me thinking more... things like Parzen window estimators or other KDEs are frequentist...

But while it's a probability distribution, to a frequentist they are estimating the fixed parameters of a distribution.

The distribution isn't generative, it just represents uncertainty - and I think that's a bit of the deep core philosophical divide between frequentists and Bayesians - you might use all the same math, but you cannot possibly think of it as being generative.

reply
jmalicki 18 hours ago
They do!

https://arxiv.org/pdf/2510.18777

But that doesn't mean a frequentist views a VAE as a generative model!

Putting it another way, Gaussian processes originated as a frequentist technique! But to a frequentist they are not generative.

reply
hawtads 18 hours ago
Ooh good find, thanks for the link. This will be my bedtime reading for this week :)
reply
DeathArrow 13 hours ago
Most ML algorithms, be it SVM, random forest or neural networks require parameter tuning. That in itself is using bayesian statistics.
reply
7777777phil 8 hours ago
Most ML practitioners use L1/L2 daily without realizing they're making Bayesian prior assumptions. Gaussian prior = Ridge, Laplace prior = Lasso. Once you see it that way, "choosing a regularization strength" is really "choosing how informative your prior is."
reply
lottin 19 hours ago
> In Bayesian statistics, on the other hand, the parameter is not a point but a distribution.

To be more precise, in Bayesian statistics a parameter is random variable. But what does that mean? A parameter is a characteristic of a population (as opposed to a characteristic of a sample, which is called a statistic). A quantity, such as the average cars per household right now. That's a parameter. To think of a parameter as a random variable is like regarding reality as just one realisation of an infinite number of alternate realities that could have been. The problem is we only observe our reality. All the data samples that we can ever study come from this reality. As a result, it's impossible to infer anything about the probability distribution of the parameter. The whole Bayesian approach to statistical inference is nonsensical.

reply