（main）贝叶斯统计 | 贝叶斯定理 | 贝叶斯推断 | 贝叶斯线性回归

2019年08月31日更新

看了一篇发在NM上的文章才又明白了贝叶斯方法的重要性和普适性，结合目前最火的DL，会有意想不到的结果。

目前一些最直觉性的理解：

概率的核心就是可能性空间一定，三体世界不会有概率
贝叶斯的基础就是条件概率，条件概率的核心就是可能性空间的缩小，获取了新的信息就是个可能性空间缩小的过程
贝叶斯定理的核心就是，先验*似然=后验，有张图可以完美可视化这个定理
只要我们能得到可靠的先验或似然，任意一个，我们就能得到更可靠的后验概率

最近又在刷一个Coursera的课程：Bayesian Statistics: From Concept to Data Analysis，希望能更系统地学习一下。

week1

In this module, we review the basics of probability and Bayes’ theorem.

In Lesson 1, we introduce the different paradigms or definitions of probability and discuss why probability provides a coherent framework for dealing with uncertainty.
In Lesson 2, we review the rules of conditional probability and introduce Bayes’ theorem.
Lesson 3 reviews common probability distributions for discrete and continuous random variables.

基本问题：

概率的不同定义，Classical framework，Frequentist framework，Bayesian framework
什么是条件概率
使用贝叶斯定理计算条件概率，一个罕见病的案例
理解常见的概率分布，能写出其期望、方差、PDF、PMF，很有价值的总结
计算常见分布的概率结果
理解中心极限定理central limit theorem，抽样分布：with sufficiently large sample sizes, the sample average approximately follows a normal distribution. 确定了正态分布的核心地位。
Bayesian and Frequentist在哲学上的区别，客观与主观、决定论与信息论

其他：

Probability和Odds的区别
complement是c的缩写
随机变量的期望和方差
理解indicator functions，设置定义域
continuous version of Bayes’ theorem，the sum gets replaced with an integral，对所有情况下的θ进行积分。

week2

This module introduces concepts of statistical inference from both frequentist and Bayesian perspectives.

Lesson 4 takes the frequentist view, demonstrating maximum likelihood estimation and confidence intervals for binomial data.
Lesson 5 introduces the fundamentals of Bayesian inference. Beginning with a binomial likelihood and prior probabilities for simple hypotheses, you will learn how to use Bayes’ theorem to update the prior with data to obtain posterior probabilities. This framework is extended with the continuous version of Bayes theorem to estimate continuous model parameters, and calculate posterior probabilities and credible intervals.

基本问题：

什么是似然函数，之前写过类似文章，其实似然函数和条件概率函数是一个东西，只是given的变量不一样，一个是参数，一个是事件。最直观的差别就是条件概率之和为1，而似然则不是，它比较的是不同参数出现的似然值得大小，绝对不能说是参数出现的概率。
什么是频率学派的CI置信域？以抛硬币为例，我们有一个观测值，然后根据置信域我们可以得出95%的CI，解释就是We're 95% confident that the true probability of getting a head is in this interval. Each time we create a confidence interval in this way based on the data we observe. Than on average 95% of the intervals we make will contain the true value of p. 因为我们假设我们得到的结果不是小概率事件，所以我们重复很多次会发现95%的都会包含真值。Does this interval contain the true p. What's the probability that this interval contains a true p? Well, we don't know for this particular interval. 但是我们无法回答某个特定的置信域的问题，因此我们需要贝叶斯置信域。
计算伯努利分布和二项分布的置信域
MLE最大似然估计的应用，对似然函数求导，得到其最大值。MLE属于点估计，可以用中心极限定理来求CI。
贝叶斯后验区间，there is probably a p is in this interval is 95% based on a random interpretation of an unknown parameter

统计分布没有派别，它是关于数据的概率描述

likelihood <- function(n,y,theta){

  return(theta^y*(1-theta)^(n-y))

}

theta <- seq(from=0.01, to=0.99, by=0.01)

plot(theta, likelihood(400,72,theta))

似然值的计算和普通概率的计算差不多　

其他：

Argmax
Cumulative distribution function (CDF)，离散和连续分布都有，定义为小于一定值得概率，最大值为1，单调递增，因为事件概率不能为负。
Probability density function (PDF)，只有连续才有，积分后就得到了CDF
Probability mass function (PMF)，只有离散才有，就是简单的单点概率
R中多个概率分布的差别，dnorm：PDF，pnorm：CDF，qnorm： quantile function，rnorm：pseudo-random samples

R代码练习：

题1：X ∼ Binomial(5, 0.6)，如何求F(1)? 为了避免求积分，可以直接用CDF，也就是p开头的函数来求，pbinom(1, 5, 0.6)，直接算出了P(X ≤ 1)的CDF。另外也可以用qbinom来验证， qbinom(p=0.087, size=5, prob=0.6)，这是求CDF为p的情况下，x的近似值。

题2：Y ∼ Exp(1)，求CDF分别为0.1和0.9的Y值，直接拿qexp函数来求，qexp(0.1, rate = 1)，其中0.1也可以换成vector，批量求。

1. Let X ∼ Pois(3). Find P(X = 1). (0.149): dpois(1, lambda = 3)
2. Let X ∼ Pois(3). Find P(X ≤ 1). (0.199): ppois(1, lambda = 3)
3. Let X ∼ Pois(3). Find P(X > 1). (0.801): 1-0.199
4. Let Y ∼ Gamma(2, 1/3). Find P(0.5 < Y < 1.5). (0.078): pgamma(1.5, shape = 2, rate = 1/3) - pgamma(0.5, shape = 2, rate = 1/3)
5. Let Z ∼ N(0, 1). Find z such that P(Z < z) = 0.975. (1.96): qnorm(0.975, mean = 0, sd = 1)
6. Let Z ∼ N(0, 1). Find P(−1.96 < Z < 1.96). (0.95): pnorm(1.96, mean = 0, sd = 1) - pnorm(-1.96, mean = 0, sd = 1)
7. Let Z ∼ N(0, 1). Find z such that P(−z < Z < z) = 0.90. (1.64): qnorm(0.05, mean = 0, sd = 1)

总结：

离散：单点概率用d函数。

连续：单点无概率，所以d函数返回的是该点在PDF上的高度/值。

p函数都是给定一个值得到小于该值得F函数值，也就是概率；q函数则是给定一个概率得到其对应的一个F函数对应的x值，是p函数的逆函数。

r函数则是随机从指定分布中抽取n的数值出来。

参考：Introduction to dnorm, pnorm, qnorm, and rnorm for new biostatisticians

贝叶斯推断

从先验到后验的更新过程
频率派和贝叶斯派的统计推断的差别
贝叶斯推断在离散型和连续型数据上的应用
为什么说贝叶斯定理分母下面的是normlizing constant？因为它最初的起源就是f(y)或p(y)，与θ无关，表示我们观测数据出现的概率，然后后面为了计算方便，才用全概率公司展开。
f(θ|x) ∝ f(x|θ)f(θ)，The symbol ∝ stands for “is proportional to.”

week3

In this module, you will learn methods for selecting prior distributions and building models for discrete data.

Lesson 6 introduces prior selection and predictive distributions as a means of evaluating priors.
Lesson 7 demonstrates Bayesian analysis of Bernoulli data and introduces the computationally convenient concept of conjugate priors.
Lesson 8 builds a conjugate model for Poisson data and discusses strategies for selection of prior hyperparameters.

基本问题：

Understand the prior as representing information.
Understand the concept of conjugate priors.
Recognize the posterior mean as a weighted average of the prior mean and the data estimates, and understand the concept of an effective sample size of a prior.
Compute posterior probabilities for Bernoulli, binomial, and Poisson likelihoods.

学会公式推导，When we use a uniform prior for a Bernoulli likelihood, we get a beta posterior. lesson 7.1. In fact, the uniform distribution, is a beta one one.

什么是共轭分布？在某个分布的likelihood的作用下，prior和posterior同分布，则我们称这两个分布共轭。共轭具有非常好的数学性质，完美的符合了我们的先验后验彼此交替的需求。And any beta distribution, is conjugate for the Bernoulli distribution. Any beta prior, will give a beta posterior.

理解gamma分布和beta分布？gamma可以理解为阶乘函数，beta是由gamma组合而来。

如何选择合适的先验概率？poisson的likelihood需要设置gamma的prior，会得到gamma的posterior。

Posterior mean and effective sample size，关于beta分布的两个计算问题，以及beta后验是如何更新的。可以看到一个很有用的权重：posterior mean = prior weight * prior mean + data weight + data mean

This effective sample size also gives you an idea of how much data you would need to make sure that you're prior doesn't have much influence on your posterior.

贝叶斯的一个最好的应用：Medical devices, you often have very small sample sizes. But you're only making minor updates to the devices and you're doing new trials. The ability of Bayesian statistics to do easy sequential updates made it very practical and appealing For the medical device testing industry.

能熟练用R来求解伯努利、二项分布和beta分布的问题：

这里需要很好的区分两个概率，第一个概率就是二项分布的概率X，这虽然是个概率，但可以看做是一个随机变量；第二个就是这个随机变量X发生的概率。

理解这里画的图的含义，不是传统的概率分布图，而是随机变量的似然值，dbeta就是随机变量取特定值时的PDF上的值。这里画了不同prior下，随机变量的不同似然分布。

需要加深前面的关于p函数的理解，p函数就是用求F函数的某个值下的概率，就是小于等于某个值的概率，最小为0，最大为1. 这里的pbeta就很好理解了。

二项分布作为likelihood来更新beta函数则十分简单，直接对beta分布的两个参数进行加减即可。

如何把先验、likelihood和后验全部画到一个图里？the posterior mean is somewhere in between the maximum likelihood estimate and the prior mean of two-thirds. 其实画到一张图里是不严谨的，需要调整一下scale，核心是要明白数据的集中度发生了变化。

在制作巧克力饼干时，the number of chips per cookie approximately falls a Poisson distribution. 与gamma分布共轭。gamma的期望和方差一定要知道。

# Suppose we are giving two students a multiple-choice exam with 40 questions,

# where each question has four choices. We don't know how much the students

# have studied for this exam, but we think that they will do better than just

# guessing randomly.

# 1) What are the parameters of interest?

# 2) What is our likelihood?

# 3) What prior should we use?

# 4) What is the prior probability P(theta>.25)? P(theta>.5)? P(theta>.8)?

# 5) Suppose the first student gets 33 questions right. What is the posterior

#    distribution for theta1? P(theta1>.25)? P(theta1>.5)? P(theta1>.8)?

#    What is a 95% posterior credible interval for theta1?

# 6) Suppose the second student gets 24 questions right. What is the posterior

#    distribution for theta2? P(theta2>.25)? P(theta2>.5)? P(theta2>.8)?

#    What is a 95% posterior credible interval for theta2?

# 7) What is the posterior probability that theta1>theta2, i.e., that the

#    first student has a better chance of getting a question right than

#    the second student?

############

# Solutions:

# 1) Parameters of interest are theta1=true probability the first student

#    will answer a question correctly, and theta2=true probability the second

#    student will answer a question correctly.

# 2) Likelihood is Binomial(40, theta), if we assume that each question is

#    independent and that the probability a student gets each question right

#    is the same for all questions for that student.

# 3) The conjugate prior is a beta prior. Plot the density with dbeta.

theta=seq(from=0,to=1,by=.01)

plot(theta,dbeta(theta,1,1),type="l")

plot(theta,dbeta(theta,4,2),type="l")

plot(theta,dbeta(theta,8,4),type="l")

# 4) Find probabilities using the pbeta function.

1-pbeta(.25,8,4)

1-pbeta(.5,8,4)

1-pbeta(.8,8,4)

# 5) Posterior is Beta(8+33,4+40-33) = Beta(41,11)

41/(41+11)  # posterior mean

33/40       # MLE

lines(theta,dbeta(theta,41,11))

# plot posterior first to get the right scale on the y-axis

plot(theta,dbeta(theta,41,11),type="l")

lines(theta,dbeta(theta,8,4),lty=2)

# plot likelihood

lines(theta,dbinom(33,size=40,p=theta),lty=3)

# plot scaled likelihood

lines(theta,44*dbinom(33,size=40,p=theta),lty=3)

# posterior probabilities

1-pbeta(.25,41,11)

1-pbeta(.5,41,11)

1-pbeta(.8,41,11)

# equal-tailed 95% credible interval

qbeta(.025,41,11)

qbeta(.975,41,11)

# 6) Posterior is Beta(8+24,4+40-24) = Beta(32,20)

32/(32+20)  # posterior mean

24/40       # MLE

plot(theta,dbeta(theta,32,20),type="l")

lines(theta,dbeta(theta,8,4),lty=2)

lines(theta,44*dbinom(24,size=40,p=theta),lty=3)

1-pbeta(.25,32,20)

1-pbeta(.5,32,20)

1-pbeta(.8,32,20)

qbeta(.025,32,20)

qbeta(.975,32,20)

# 7) Estimate by simulation: draw 1,000 samples from each and see how often

#    we observe theta1>theta2

theta1=rbeta(1000,41,11)

theta2=rbeta(1000,32,20)

mean(theta1>theta2)

# Note for other distributions:

# dgamma,pgamma,qgamma,rgamma

# dnorm,pnorm,qnorm,rnorm

week4

This module covers conjugate and objective Bayesian analysis for continuous data.

Lesson 9 presents the conjugate model for exponentially distributed data.
Lesson 10 discusses models for normally distributed data, which play a central role in statistics.
In Lesson 11, we return to prior selection and discuss ‘objective’ or ‘non-informative’ priors.
Lesson 12 presents Bayesian linear regression with non-informative priors, which yield results comparable to those of classical regression.

指数分布

For example, suppose you're waiting for a bus that you think comes on average once every ten minutes, but you're not sure exactly how often it comes.

gamma distribution is conjugate for an exponential likelihood. Gammas actually are conjugate for a number of different things.

如何选择合适的prior？

一文包含所有：Probability concepts explained: Bayesian inference for parameter estimation.

贝叶斯并不难，关键是要能熟能生巧，熟练运用在生活各个方面，应用到各个项目。

最近发现遗传领域用贝叶斯实在是太普遍了，不得不再温习一遍。

所谓高手，就是把自己活成了贝叶斯定理 - 他的引入和案例非常好，只是深究的话有些问题。

首先理解条件概率：

P(A|B)是什么意思，P(A)可以简单的用venn图来可视化，就是内圈的面积；P(A|B)就是在限定空间B下，A的概率。举例：假设在一个大公司，每个人都有升迁的概率：P(升)，我想知道拿到MBA后升迁的概率：P(升|MBA)，理论上：我们找出所有拿到MBA的人，再一一统计他们是否升迁就可以得到这个概率了，真要这样那就不用贝叶斯了。

实际上，我们永远只能做抽样估计。贝叶斯公式是对称的，通常是有一边是我们感兴趣的，但是无法求解，所以我们可以曲线救国，求另外一边。还有一个就是全概率公式，这个在venn图中也特别好理解，就是把全集拆成几个互斥的部分，分别求解。

贝叶斯的灵魂就是先验、后验和调整因子，如何在实际生活中理解和贯彻这个才是关键。

先验：some knowledge or belief that we already have (commonly known as the prior)，不用太复杂，先验就是指我们已经获得的知识，通常是marginal probability。 P(A) is a prior to me knowing anything about the B. 先验可以是猜测的，可以包含一定的主观因素。更规范一点我们的P(A)不是一个固定值，而是一个分布，prior distribution。

后验：是指我们得到一些新的数据后，我们原猜测发生的概率，相当于是对原先主观先验的一个更新。P(Θ|data) on the left hand side is known as the posterior distribution. This is the distribution representing our belief about the parameter values after we have calculated everything on the right hand side taking the observed data into account.

核心：Therefore we can calculate the posterior distribution of our parameters using our prior beliefs updated with our likelihood.

共轭和传递性

贝叶斯推断及其互联网应用（一）：定理简介

作为一个非统计专业的人，着实是被贝叶斯思想折磨了很久，常见的公式都能倒背如流，但依旧无法理解其精神内核。

近日，有高人指点，自己再查了点资料，才对贝叶斯思想有所领悟。。。

基本框架：前面总结了常见分布的概念，这里贝叶斯也不例外，都是概率论，概率研究的核心就是随机事件发生的概率。以后遇到统计时，要习惯“某事件发生概率”这种专业说法。

例子1：

两个一模一样的碗，一号碗有30颗水果糖和10颗巧克力糖，二号碗有水果糖和巧克力糖各20颗。现在随机选择一个碗，从中摸出一颗糖，发现是水果糖。请问这颗水果糖来自一号碗的概率有多大？

作者的描述有点含糊，这里会修正一下：

我们假定，H1表示摸出的球来自一号碗，H2表示摸出的球来自二号碗。由于我们假定这两个碗是一样的（先验概率已被指定），所以P(H1)=P(H2)，也就是说，在取出水果糖之前，这两个碗被选中的概率相同。因此，P(H1)=0.5，我们把这个概率就叫做"先验概率"，即没有做实验之前，来自一号碗的概率是0.5。

再假定，E表示取出的是水果糖，所以问题就变成了在已知E的情况下，来自一号碗的概率有多大，即求P(H1|E)。我们把这个概率叫做"后验概率"，即在E事件发生之后，对P(H1)的修正。

后面计算就不说了，老生常谈，主要是把概念规范化，不要把数学语言和日常用于混淆在一起。

例子2：

已知某种疾病的发病率是0.001，即1000人中会有1个人得病。现有一种试剂可以检验患者是否得病，它的准确率是0.99，即在患者确实得病的情况下，它有99%的可能呈现阳性。它的误报率是5%，即在患者没有得病的情况下，它有5%的可能呈现阳性。现有一个病人的检验结果为阳性，请问他确实得病的可能性有多大？

假定A事件表示得病，那么P(A)为0.001。这就是"先验概率"，即没有做试验之前，我们预计的发病率。再假定B事件表示阳性，那么要计算的就是P(A|B)。这就是"后验概率"，即做了试验以后，对发病率的估计。

其实在医学统计学里，99%不叫作准确率，而是sensitivity。

5%也不叫作误报率，而叫做假阳性率，与之对应的是specificity。

计算过程可以参照原文。

参考：Precision and recall

实例：垃圾邮件过滤

贝叶斯过滤器是一种统计学过滤器，建立在已有的统计结果之上。所以，我们必须预先提供两组已经识别好的邮件，一组是正常邮件，另一组是垃圾邮件。

我们用这两组邮件，对过滤器进行"训练"。这两组邮件的规模越大，训练效果就越好。Paul Graham使用的邮件规模，是正常邮件和垃圾邮件各4000封。

"训练"过程很简单。首先，解析所有邮件，提取每一个词。然后，计算每个词语在正常邮件和垃圾邮件中的出现频率。比如，我们假定"sex"这个词，在4000封垃圾邮件中，有200封包含这个词，那么它的出现频率就是5%；而在4000封正常邮件中，只有2封包含这个词，那么出现频率就是0.05%。（【注释】如果某个词只出现在垃圾邮件中，Paul Graham就假定，它在正常邮件的出现频率是1%，反之亦然。这样做是为了避免概率为0。随着邮件数量的增加，计算结果会自动调整。）

有了这个初步的统计结果，过滤器就可以投入使用了。

Github上有这个的代码，可以去跑一跑。

前面已经说了贝叶斯是一种思想，它可以被用在任何统计模型上。这也就是为什么你能听到各种贝叶斯相关的术语：贝叶斯线性回归、贝叶斯广义线性回归等等。

接下来就从最简单的贝叶斯线性回归为例，来讲解贝叶斯思想是如何与传统统计模型相结合的。

参考：贝叶斯线性回归（Bayesian Linear Regression）

如何通俗地解释贝叶斯线性回归的基本原理？

从贝叶斯的观点看线性分类和线性回归