Prior probability is the probability of an event before we get extra data. In Bayesian inference, the prior distribution is our assumption of probability based on what we already know.
The conjugate prior distribution cannot be understood without knowing Bayesian inference. In the following, we will assume that you know the principles of prior, sample, and posterior distributions.
For some probability functions, if a certain prior is chosen, the posterior probability will belong to the same family of distributions as this prior. This prior distribution is called conjugate.
It is easier to explain with examples. Below is the code for calculating the posterior binomial probability distribution. θ is the probability of success, and our task is to choose the θ that maximizes the posterior probability.
There are two things that make it difficult to calculate the posterior probability. First, we compute the posterior probability for each θ.
Then we normalize the posterior probability (line 21). Even if you choose not to normalize it, the ultimate challenge is to find the posterior maximum (Estimated posterior maximum). To find the maximum, we need to consider each candidate – the likelihood P (X | θ) for each θ.
Second, if there is no closed posterior distribution formula, we need to find the maximum using numerical optimization (such as gradient descent or Newton’s method).
Knowing that the prior distribution is conjugate, you can skip the posterior = likelihood prior computation. Plus, if your prior distribution is closed, you already know what the posterior maximum should be.
In the example, the beta distribution is the conjugate prior distribution of the binomial distribution. What does it mean? At the modeling stage, we already know that the posterior distribution will also be a beta distribution. Therefore, after more experimentation, you can calculate the posterior probability by simply adding the number of successes and failures to the existing parameters α, β, respectively, instead of multiplying the probability function and the prior distribution. It is very convenient!
In data science or machine learning, a model is never completely complete – you constantly need to update it with new data (which is why Bayesian inference is used).
The calculations in Bayesian inference can be quite laborious, sometimes even unsolvable. But if you use a closed-form formula with a conjugate prior distribution, calculating them becomes easier than ever.
There are a few notes:
When we use the conjugate prior distribution, sequential estimation (updating calculations after each observation) gives the same result as the group distribution.
To find the posterior maximum, you do not need to normalize the product of the probability function (sample distribution) and the prior distribution (integrating all possible θ in the denominator).
The maximum can be found without normalization, as well. However, if you need to compare posterior probabilities from different models or calculate point estimates, normalization is necessary.