[预测分析2021Spring]Chapter 4 Bayesian statistics

2021-06-08 18:05


标签:useful   infer   ide   standard   ssi   eve   from   更新   models   

Chapter 4 Bayesian statistics 贝叶斯统计初步

\[p(\boldsymbol{\theta} \mid \mathcal{D})=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{p(\mathcal{D})}=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{\int p\left(\boldsymbol{\theta}^{\prime}\right) p\left(\mathcal{D} \mid \boldsymbol{\theta}^{\prime}\right) d \boldsymbol{\theta}^{\prime}} \]

\(p(\mathcal{D})\): marginal likelihood/normalizing constant.Do not depend on \(\theta\).数据的边际分布,与参数值/参数的分布无关。

We will contain the following parts:

  • summarize posterior
  • posterior computation with conjugate prior
  • Bayesian mdel comparision
  • approximate posterior

summarizing the posterior

point estimates

eg:posterior mean,posterior median

credible interval

measure confidence in our paremeter estimates.(small sample size or low quality of data may lead to larger uncertainty)

we use \(100(1-\alpha)\%\) credible interval,which contains \(1-\alpha\) of the posterior probability mass.

\[P(l\le \theta \le \mu|D)=1-\alpha \]

posterior computation with conjugate prior

conjugate prior

In this section, we consider a set of (prior, likelihood) pairs for which we can compute the posterior in closed form.


In particular, we will use priors that are "conjugate" to the likelihood.

We say that a prior \(p(\boldsymbol{\theta}) \in \mathcal{F}\) is a conjugate prior for a likelihood function \(p(\mathcal{D} \mid \boldsymbol{\theta})\) if the posterior is in the same parameterized family as the prior, i.e., \(p(\boldsymbol{\theta} \mid \mathcal{D}) \in \mathcal{F}\).


In other words, \(\mathcal{F}\) is closed under Bayesian updating. If the family \(\mathcal{F}\) corresponds to the exponential family , then the computations can be performed in closed form.


The Dirichlet-multinomial model

  • 似然函数:

Let \(Y \sim \operatorname{Cat}(\boldsymbol{\theta})\) be a discrete random variable drawn from a categorical distribution. The likelihood has the form

\[p(\mathcal{D} \mid \boldsymbol{\theta})=\prod_{n=1}^{N} \operatorname{Cat}\left(y_{n} \mid \boldsymbol{\theta}\right)=\prod_{n=1}^{N} \prod_{c=1}^{C} \theta_{c}^{\mathrm{I}\left(y_{n}=c\right)}=\prod_{c=1}^{C} \theta_{c}^{N_{c}} \]

where \(N_{c}=\sum_{n} \mathbb{I}\left(y_{n}=c\right)\):#face c appears


  • 先验分布:

The conjugate prior for a categorical distribution is the Dirichlet distribution, which is a multivariate generalization of the beta distribution. This has support over the probability simplex, defined by

\[S_{K}=\left\{\boldsymbol{\theta}: 0 \leq \theta_{k} \leq 1, \sum_{k=1}^{K} \theta_{k}=1\right\} \]

The pdf of the Dirichlet is defined as follows:

\[\operatorname{Dir}(\boldsymbol{\theta} \mid \breve{\boldsymbol{\alpha}}) \triangleq \frac{1}{B(\breve{\boldsymbol{\alpha}})} \prod_{k=1}^{K} \theta_{k}^{\breve{\alpha}_{k}-1} \mathbb{I}\left(\boldsymbol{\theta} \in S_{K}\right) \]

where \(B(\vec{\alpha})\) is the multivariate beta function,

\[B(\breve{\boldsymbol{\alpha}}) \triangleq \frac{\prod_{k=1}^{K} \Gamma\left(\breve{\alpha}_{k}\right)}{\Gamma\left(\sum_{k=1}^{K} \breve{\alpha}_{k}\right)} \]



  • 后验分布:

We can combine the multinomial likelihood and Dirichlet prior to compute the posterior, as follows:

\[\begin{aligned} p(\boldsymbol{\theta} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\theta}) \operatorname{Dir}(\boldsymbol{\theta} \mid \widetilde{\boldsymbol{\alpha}}) \&=\left[\prod_{k} \theta_{k}^{N_{k}}\right]\left[\prod_{k} \theta_{k}^{\breve{\alpha}_{k}-1}\right] \&=\operatorname{Dir}\left(\boldsymbol{\theta} \mid \breve{\alpha}_{1}+N_{1}, \ldots, \breve{\alpha}_{K}+N_{K}\right) \&=\operatorname{Dir}(\boldsymbol{\theta} \mid \hat{\boldsymbol{\alpha}}) \end{aligned} \]

where \(\widehat{\alpha}_{k}=\breve{\alpha}_{k}+N_{k}\) are the parameters of the posterior.

The Gaussian-Gaussian model



If \(\mu\) is a known constant, the likelihood for \(\sigma^{2}\) has the form

\[p\left(\mathcal{D} \mid \sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-N / 2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2}\right) \]


where we can no longer ignore the \(1 /\left(\sigma^{2}\right)\) term in front. The standard conjugate prior is the inverse Gamma distribution , given by 逆Gamma分布(=1/Gamma分布)

\[\mathrm{IG}\left(\sigma^{2} \mid \breve{a}, \breve{b}\right)=\frac{\breve{b}^{\breve{a}}}{\Gamma(\breve{a})}\left(\sigma^{2}\right)^{-(\breve{a}+1)} \exp \left(-\frac{\breve{b}}{\sigma^{2}}\right) \]


\[\begin{aligned} p(\boldsymbol{\sigma^2} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\sigma^2}) p(\boldsymbol{\sigma^2} \mid \breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}) \&=(\sigma^2)^{-(\frac{N}{2}+\breve{\boldsymbol{\alpha}}+1)}exp(-\frac{\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2}{\sigma^2}) \end{aligned} \]

i.e.\(IG(\frac{N}{2}+\breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2)\)

Multiplying the likelihood and the prior, we see that the posterior is also IG:

\[\begin{aligned} p\left(\sigma^{2} \mid \mu, \mathcal{D}\right) &=\mathrm{IG}\left(\sigma^{2} \mid \widehat{a}, \widehat{b}\right) \\widehat{a} &=\breve{a}+N / 2 \\widehat{b} &=\widetilde{b}+\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2} \end{aligned} \]

Generally,we do not have closed form posterior, so we have to used approximate inference method.

Bayesian Model comparison

All model are wrong , but some are useful——George Box.

we assume we have a set of models \(M\).

objective: we want to choose the best model from some set \(M\).

\[\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(m \mid \mathcal{D}) \]


\[p(m \mid \mathcal{D})=\frac{p(\mathcal{D} \mid m) p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} \mid m) p(m)} \]



\(p(\mathcal{D} \mid m)\):当选定模型m之后的边际概率密度(相当于\(P(\theta \mid D)=\frac{P(\theta) \cdot P(D \mid \theta)}{P(D)}\)\(中的p(\mathcal{D}\)) )

If the prior over models is uniform, \(p(m)=1 /|\mathcal{M}|,\) then the MAP model is given by


\[\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(\mathcal{D} \mid m) \]

The quantity \(p(\mathcal{D} \mid m)\) is given by

\[p(\mathcal{D} \mid m)=\int p(\mathcal{D} \mid \boldsymbol{\theta}, m) p(\boldsymbol{\theta} \mid m) d \boldsymbol{\theta} \]

Bayes model averaging

If our goal is to perform prediction, we can get better results if we marginalize out over all models, by computing

\[p(y \mid \mathbf{x}, \mathcal{D})=\sum_{m \in \mathcal{M}} p(y \mid \mathbf{x}, m) p(m \mid \mathcal{D}) \]

D:training data,(x,y):new data,goal:predict y

Disadvantage : computationally very expensive.


[预测分析2021Spring]Chapter 4 Bayesian statistics

标签:useful   infer   ide   standard   ssi   eve   from   更新   models   


