[预测分析2021Spring]Chapter 4 Bayesian statistics

2021-06-08 18:05

阅读:575

标签:useful   infer   ide   standard   ssi   eve   from   更新   models   

Chapter 4 Bayesian statistics 贝叶斯统计初步

\[p(\boldsymbol{\theta} \mid \mathcal{D})=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{p(\mathcal{D})}=\frac{p(\boldsymbol{\theta}) p(\mathcal{D} \mid \boldsymbol{\theta})}{\int p\left(\boldsymbol{\theta}^{\prime}\right) p\left(\mathcal{D} \mid \boldsymbol{\theta}^{\prime}\right) d \boldsymbol{\theta}^{\prime}} \]

\(p(\mathcal{D})\): marginal likelihood/normalizing constant.Do not depend on \(\theta\).数据的边际分布,与参数值/参数的分布无关。

We will contain the following parts:

  • summarize posterior
  • posterior computation with conjugate prior
  • Bayesian mdel comparision
  • approximate posterior

summarizing the posterior

point estimates

eg:posterior mean,posterior median

credible interval

measure confidence in our paremeter estimates.(small sample size or low quality of data may lead to larger uncertainty)

we use \(100(1-\alpha)\%\) credible interval,which contains \(1-\alpha\) of the posterior probability mass.

\[P(l\le \theta \le \mu|D)=1-\alpha \]

posterior computation with conjugate prior

conjugate prior

In this section, we consider a set of (prior, likelihood) pairs for which we can compute the posterior in closed form.

在这一章中我们主要考虑后验分布具有显式解的这一部分。

In particular, we will use priors that are "conjugate" to the likelihood.

We say that a prior \(p(\boldsymbol{\theta}) \in \mathcal{F}\) is a conjugate prior for a likelihood function \(p(\mathcal{D} \mid \boldsymbol{\theta})\) if the posterior is in the same parameterized family as the prior, i.e., \(p(\boldsymbol{\theta} \mid \mathcal{D}) \in \mathcal{F}\).

“共轭”是指先验分布和后验分布均属于同一分布族。

In other words, \(\mathcal{F}\) is closed under Bayesian updating. If the family \(\mathcal{F}\) corresponds to the exponential family , then the computations can be performed in closed form.

分布族关于Bayesian更新是封闭的。

The Dirichlet-multinomial model

  • 似然函数:

Let \(Y \sim \operatorname{Cat}(\boldsymbol{\theta})\) be a discrete random variable drawn from a categorical distribution. The likelihood has the form

\[p(\mathcal{D} \mid \boldsymbol{\theta})=\prod_{n=1}^{N} \operatorname{Cat}\left(y_{n} \mid \boldsymbol{\theta}\right)=\prod_{n=1}^{N} \prod_{c=1}^{C} \theta_{c}^{\mathrm{I}\left(y_{n}=c\right)}=\prod_{c=1}^{C} \theta_{c}^{N_{c}} \]

where \(N_{c}=\sum_{n} \mathbb{I}\left(y_{n}=c\right)\):#face c appears

eg:有6面的骰子(C=6,N组数据),\(N_c\)代表在N个数据中第c面出现的个数

  • 先验分布:

The conjugate prior for a categorical distribution is the Dirichlet distribution, which is a multivariate generalization of the beta distribution. This has support over the probability simplex, defined by

\[S_{K}=\left\{\boldsymbol{\theta}: 0 \leq \theta_{k} \leq 1, \sum_{k=1}^{K} \theta_{k}=1\right\} \]

The pdf of the Dirichlet is defined as follows:

\[\operatorname{Dir}(\boldsymbol{\theta} \mid \breve{\boldsymbol{\alpha}}) \triangleq \frac{1}{B(\breve{\boldsymbol{\alpha}})} \prod_{k=1}^{K} \theta_{k}^{\breve{\alpha}_{k}-1} \mathbb{I}\left(\boldsymbol{\theta} \in S_{K}\right) \]

where \(B(\vec{\alpha})\) is the multivariate beta function,

\[B(\breve{\boldsymbol{\alpha}}) \triangleq \frac{\prod_{k=1}^{K} \Gamma\left(\breve{\alpha}_{k}\right)}{\Gamma\left(\sum_{k=1}^{K} \breve{\alpha}_{k}\right)} \]

\(\breve{\boldsymbol{\alpha}}\):是超参数,是给定的,用来推断参数的后验分布

狄利克雷分布是beta分布的推广,用来模拟(多元?)概率密度的分布,beta是1个概率密度的分布。

  • 后验分布:

We can combine the multinomial likelihood and Dirichlet prior to compute the posterior, as follows:

\[\begin{aligned} p(\boldsymbol{\theta} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\theta}) \operatorname{Dir}(\boldsymbol{\theta} \mid \widetilde{\boldsymbol{\alpha}}) \&=\left[\prod_{k} \theta_{k}^{N_{k}}\right]\left[\prod_{k} \theta_{k}^{\breve{\alpha}_{k}-1}\right] \&=\operatorname{Dir}\left(\boldsymbol{\theta} \mid \breve{\alpha}_{1}+N_{1}, \ldots, \breve{\alpha}_{K}+N_{K}\right) \&=\operatorname{Dir}(\boldsymbol{\theta} \mid \hat{\boldsymbol{\alpha}}) \end{aligned} \]

where \(\widehat{\alpha}_{k}=\breve{\alpha}_{k}+N_{k}\) are the parameters of the posterior.

The Gaussian-Gaussian model

只讨论当\(\mu\)已知时,\(\Sigma\)的分布。

似然函数:

If \(\mu\) is a known constant, the likelihood for \(\sigma^{2}\) has the form

\[p\left(\mathcal{D} \mid \sigma^{2}\right) \propto\left(\sigma^{2}\right)^{-N / 2} \exp \left(-\frac{1}{2 \sigma^{2}} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2}\right) \]

先验分布:

where we can no longer ignore the \(1 /\left(\sigma^{2}\right)\) term in front. The standard conjugate prior is the inverse Gamma distribution , given by 逆Gamma分布(=1/Gamma分布)

\[\mathrm{IG}\left(\sigma^{2} \mid \breve{a}, \breve{b}\right)=\frac{\breve{b}^{\breve{a}}}{\Gamma(\breve{a})}\left(\sigma^{2}\right)^{-(\breve{a}+1)} \exp \left(-\frac{\breve{b}}{\sigma^{2}}\right) \]

后验分布:

\[\begin{aligned} p(\boldsymbol{\sigma^2} \mid \mathcal{D}) & \propto p(\mathcal{D} \mid \boldsymbol{\sigma^2}) p(\boldsymbol{\sigma^2} \mid \breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}) \&=(\sigma^2)^{-(\frac{N}{2}+\breve{\boldsymbol{\alpha}}+1)}exp(-\frac{\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2}{\sigma^2}) \end{aligned} \]

i.e.\(IG(\frac{N}{2}+\breve{\boldsymbol{\alpha}},\breve{\boldsymbol{b}}+\frac{1}{2}\sum_{n=1}^N (y_n-\mu)^2)\)

Multiplying the likelihood and the prior, we see that the posterior is also IG:

\[\begin{aligned} p\left(\sigma^{2} \mid \mu, \mathcal{D}\right) &=\mathrm{IG}\left(\sigma^{2} \mid \widehat{a}, \widehat{b}\right) \\widehat{a} &=\breve{a}+N / 2 \\widehat{b} &=\widetilde{b}+\frac{1}{2} \sum_{n=1}^{N}\left(y_{n}-\mu\right)^{2} \end{aligned} \]

Generally,we do not have closed form posterior, so we have to used approximate inference method.

Bayesian Model comparison

All model are wrong , but some are useful——George Box.

we assume we have a set of models \(M\).

objective: we want to choose the best model from some set \(M\).

\[\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(m \mid \mathcal{D}) \]

where

\[p(m \mid \mathcal{D})=\frac{p(\mathcal{D} \mid m) p(m)}{\sum_{m \in \mathcal{M}} p(\mathcal{D} \mid m) p(m)} \]

m:model

D:Data

\(p(\mathcal{D} \mid m)\):当选定模型m之后的边际概率密度(相当于\(P(\theta \mid D)=\frac{P(\theta) \cdot P(D \mid \theta)}{P(D)}\)\(中的p(\mathcal{D}\)) )

If the prior over models is uniform, \(p(m)=1 /|\mathcal{M}|,\) then the MAP model is given by

在没有额外信息的情况下,我们认为每种模型的可能性相同

\[\hat{m}=\underset{m \in \mathcal{M}}{\operatorname{argmax}} p(\mathcal{D} \mid m) \]

The quantity \(p(\mathcal{D} \mid m)\) is given by

\[p(\mathcal{D} \mid m)=\int p(\mathcal{D} \mid \boldsymbol{\theta}, m) p(\boldsymbol{\theta} \mid m) d \boldsymbol{\theta} \]

Bayes model averaging

If our goal is to perform prediction, we can get better results if we marginalize out over all models, by computing

\[p(y \mid \mathbf{x}, \mathcal{D})=\sum_{m \in \mathcal{M}} p(y \mid \mathbf{x}, m) p(m \mid \mathcal{D}) \]

D:training data,(x,y):new data,goal:predict y

Disadvantage : computationally very expensive.

week3

[预测分析2021Spring]Chapter 4 Bayesian statistics

标签:useful   infer   ide   standard   ssi   eve   from   更新   models   

原文地址:https://www.cnblogs.com/YuzifeiYu/p/14508016.html


评论


亲,登录后才可以留言!