通俗直观理解贝叶斯 | 贝叶斯统计 | 贝叶斯定理 | 贝叶斯推断 | 贝叶斯线性回归 | Bayes' Theorem
Bayes theorem, the geometry of changing beliefs
- 概率的核心就是可能性空间一定,三体世界不会有概率
- 贝叶斯的基础就是条件概率,条件概率的核心就是可能性空间的缩小,获取了新的信息就是个可能性空间缩小的过程
- 贝叶斯定理的核心就是,先验*似然=后验,有张图可以完美可视化这个定理
- 只要我们能得到可靠的先验或似然,任意一个,我们就能得到更可靠的后验概率
最近又在刷一个Coursera的课程:Bayesian Statistics: From Concept to Data Analysis,希望能更系统地学习一下。
In this module, we review the basics of probability and Bayes’ theorem.
- In Lesson 1, we introduce the different paradigms or definitions of probability and discuss why probability provides a coherent framework for dealing with uncertainty.
- In Lesson 2, we review the rules of conditional probability and introduce Bayes’ theorem.
- Lesson 3 reviews common probability distributions for discrete and continuous random variables.
- 概率的不同定义,Classical framework,Frequentist framework,Bayesian framework
- 什么是条件概率
- 使用贝叶斯定理计算条件概率,一个罕见病的案例
- 理解常见的概率分布,能写出其期望、方差、PDF、PMF,很有价值的总结
- 计算常见分布的概率结果
- 理解中心极限定理central limit theorem,抽样分布:with sufficiently large sample sizes, the sample average approximately follows a normal distribution. 确定了正态分布的核心地位。
- Bayesian and Frequentist在哲学上的区别,客观与主观、决定论与信息论
- Probability和Odds的区别
- complement是c的缩写
- 随机变量的期望和方差
- 理解indicator functions,设置定义域
- continuous version of Bayes’ theorem,the sum gets replaced with an integral,对所有情况下的θ进行积分。
This module introduces concepts of statistical inference from both frequentist and Bayesian perspectives.
- Lesson 4 takes the frequentist view, demonstrating maximum likelihood estimation and confidence intervals for binomial data.
- Lesson 5 introduces the fundamentals of Bayesian inference. Beginning with a binomial likelihood and prior probabilities for simple hypotheses, you will learn how to use Bayes’ theorem to update the prior with data to obtain posterior probabilities. This framework is extended with the continuous version of Bayes theorem to estimate continuous model parameters, and calculate posterior probabilities and credible intervals.
- 什么是似然函数,之前写过类似文章,其实似然函数和条件概率函数是一个东西,只是given的变量不一样,一个是参数,一个是事件。最直观的差别就是条件概率之和为1,而似然则不是,它比较的是不同参数出现的似然值得大小,绝对不能说是参数出现的概率。
- 什么是频率学派的CI置信域?以抛硬币为例,我们有一个观测值,然后根据置信域我们可以得出95%的CI,解释就是We're 95% confident that the true probability of getting a head is in this interval. Each time we create a confidence interval in this way based on the data we observe. Than on average 95% of the intervals we make will contain the true value of p. 因为我们假设我们得到的结果不是小概率事件,所以我们重复很多次会发现95%的都会包含真值。Does this interval contain the true p. What's the probability that this interval contains a true p? Well, we don't know for this particular interval. 但是我们无法回答某个特定的置信域的问题,因此我们需要贝叶斯置信域。
- 计算伯努利分布和二项分布的置信域
- MLE最大似然估计的应用,对似然函数求导,得到其最大值。MLE属于点估计,可以用中心极限定理来求CI。
- 贝叶斯后验区间,there is probably a p is in this interval is 95% based on a random interpretation of an unknown parameter
1 2 3 4 5 | likelihood <- function (n,y,theta){ return (theta^y*(1-theta)^(n-y)) } theta <- seq (from=0.01, to=0.99, by=0.01) plot (theta, likelihood (400,72,theta)) |
- Argmax
- Cumulative distribution function (CDF),离散和连续分布都有,定义为小于一定值得概率,最大值为1,单调递增,因为事件概率不能为负。
- Probability density function (PDF),只有连续才有,积分后就得到了CDF
- Probability mass function (PMF),只有离散才有,就是简单的单点概率
- R中多个概率分布的差别,dnorm:PDF,pnorm:CDF,qnorm: quantile function,rnorm:pseudo-random samples
题1:X ∼ Binomial(5, 0.6),如何求F(1)? 为了避免求积分,可以直接用CDF,也就是p开头的函数来求,pbinom(1, 5, 0.6),直接算出了P(X ≤ 1)的CDF。另外也可以用qbinom来验证, qbinom(p=0.087, size=5, prob=0.6),这是求CDF为p的情况下,x的近似值。
题2:Y ∼ Exp(1),求CDF分别为0.1和0.9的Y值,直接拿qexp函数来求,qexp(0.1, rate = 1),其中0.1也可以换成vector,批量求。
- 1. Let X ∼ Pois(3). Find P(X = 1). (0.149): dpois(1, lambda = 3)
- 2. Let X ∼ Pois(3). Find P(X ≤ 1). (0.199): ppois(1, lambda = 3)
- 3. Let X ∼ Pois(3). Find P(X > 1). (0.801): 1-0.199
- 4. Let Y ∼ Gamma(2, 1/3). Find P(0.5 < Y < 1.5). (0.078): pgamma(1.5, shape = 2, rate = 1/3) - pgamma(0.5, shape = 2, rate = 1/3)
- 5. Let Z ∼ N(0, 1). Find z such that P(Z < z) = 0.975. (1.96): qnorm(0.975, mean = 0, sd = 1)
- 6. Let Z ∼ N(0, 1). Find P(−1.96 < Z < 1.96). (0.95): pnorm(1.96, mean = 0, sd = 1) - pnorm(-1.96, mean = 0, sd = 1)
- 7. Let Z ∼ N(0, 1). Find z such that P(−z < Z < z) = 0.90. (1.64): qnorm(0.05, mean = 0, sd = 1)
参考:Introduction to dnorm, pnorm, qnorm, and rnorm for new biostatisticians
- 从先验到后验的更新过程
- 频率派和贝叶斯派的统计推断的差别
- 贝叶斯推断在离散型和连续型数据上的应用
- 为什么说贝叶斯定理分母下面的是normlizing constant?因为它最初的起源就是f(y)或p(y),与θ无关,表示我们观测数据出现的概率,然后后面为了计算方便,才用全概率公司展开。
- f(θ|x) ∝ f(x|θ)f(θ),The symbol ∝ stands for “is proportional to.”
In this module, you will learn methods for selecting prior distributions and building models for discrete data.
- Lesson 6 introduces prior selection and predictive distributions as a means of evaluating priors.
- Lesson 7 demonstrates Bayesian analysis of Bernoulli data and introduces the computationally convenient concept of conjugate priors.
- Lesson 8 builds a conjugate model for Poisson data and discusses strategies for selection of prior hyperparameters.
- Understand the prior as representing information.
- Understand the concept of conjugate priors.
- Recognize the posterior mean as a weighted average of the prior mean and the data estimates, and understand the concept of an effective sample size of a prior.
- Compute posterior probabilities for Bernoulli, binomial, and Poisson likelihoods.
学会公式推导,When we use a uniform prior for a Bernoulli likelihood, we get a beta posterior. lesson 7.1. In fact, the uniform distribution, is a beta one one.
什么是共轭分布?在某个分布的likelihood的作用下,prior和posterior同分布,则我们称这两个分布共轭。共轭具有非常好的数学性质,完美的符合了我们的先验后验彼此交替的需求。And any beta distribution, is conjugate for the Bernoulli distribution. Any beta prior, will give a beta posterior.
Posterior mean and effective sample size,关于beta分布的两个计算问题,以及beta后验是如何更新的。可以看到一个很有用的权重:posterior mean = prior weight * prior mean + data weight + data mean
This effective sample size also gives you an idea of how much data you would need to make sure that you're prior doesn't have much influence on your posterior.
贝叶斯的一个最好的应用:Medical devices, you often have very small sample sizes. But you're only making minor updates to the devices and you're doing new trials. The ability of Bayesian statistics to do easy sequential updates made it very practical and appealing For the medical device testing industry.
需要加深前面的关于p函数的理解,p函数就是用求F函数的某个值下的概率,就是小于等于某个值的概率,最小为0,最大为1. 这里的pbeta就很好理解了。
如何把先验、likelihood和后验全部画到一个图里?the posterior mean is somewhere in between the maximum likelihood estimate and the prior mean of two-thirds. 其实画到一张图里是不严谨的,需要调整一下scale,核心是要明白数据的集中度发生了变化。
在制作巧克力饼干时,the number of chips per cookie approximately falls a Poisson distribution. 与gamma分布共轭。gamma的期望和方差一定要知道。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | # Suppose we are giving two students a multiple-choice exam with 40 questions, # where each question has four choices. We don't know how much the students # have studied for this exam, but we think that they will do better than just # guessing randomly. # 1) What are the parameters of interest? # 2) What is our likelihood? # 3) What prior should we use? # 4) What is the prior probability P(theta>.25)? P(theta>.5)? P(theta>.8)? # 5) Suppose the first student gets 33 questions right. What is the posterior # distribution for theta1? P(theta1>.25)? P(theta1>.5)? P(theta1>.8)? # What is a 95% posterior credible interval for theta1? # 6) Suppose the second student gets 24 questions right. What is the posterior # distribution for theta2? P(theta2>.25)? P(theta2>.5)? P(theta2>.8)? # What is a 95% posterior credible interval for theta2? # 7) What is the posterior probability that theta1>theta2, i.e., that the # first student has a better chance of getting a question right than # the second student? ############ # Solutions: # 1) Parameters of interest are theta1=true probability the first student # will answer a question correctly, and theta2=true probability the second # student will answer a question correctly. # 2) Likelihood is Binomial(40, theta), if we assume that each question is # independent and that the probability a student gets each question right # is the same for all questions for that student. # 3) The conjugate prior is a beta prior. Plot the density with dbeta. theta= seq (from=0,to=1,by=.01) plot (theta, dbeta (theta,1,1),type= "l" ) plot (theta, dbeta (theta,4,2),type= "l" ) plot (theta, dbeta (theta,8,4),type= "l" ) # 4) Find probabilities using the pbeta function. 1- pbeta (.25,8,4) 1- pbeta (.5,8,4) 1- pbeta (.8,8,4) # 5) Posterior is Beta(8+33,4+40-33) = Beta(41,11) 41/(41+11) # posterior mean 33/40 # MLE lines (theta, dbeta (theta,41,11)) # plot posterior first to get the right scale on the y-axis plot (theta, dbeta (theta,41,11),type= "l" ) lines (theta, dbeta (theta,8,4),lty=2) # plot likelihood lines (theta, dbinom (33,size=40,p=theta),lty=3) # plot scaled likelihood lines (theta,44* dbinom (33,size=40,p=theta),lty=3) # posterior probabilities 1- pbeta (.25,41,11) 1- pbeta (.5,41,11) 1- pbeta (.8,41,11) # equal-tailed 95% credible interval qbeta (.025,41,11) qbeta (.975,41,11) # 6) Posterior is Beta(8+24,4+40-24) = Beta(32,20) 32/(32+20) # posterior mean 24/40 # MLE plot (theta, dbeta (theta,32,20),type= "l" ) lines (theta, dbeta (theta,8,4),lty=2) lines (theta,44* dbinom (24,size=40,p=theta),lty=3) 1- pbeta (.25,32,20) 1- pbeta (.5,32,20) 1- pbeta (.8,32,20) qbeta (.025,32,20) qbeta (.975,32,20) # 7) Estimate by simulation: draw 1,000 samples from each and see how often # we observe theta1>theta2 theta1= rbeta (1000,41,11) theta2= rbeta (1000,32,20) mean (theta1>theta2) # Note for other distributions: # dgamma,pgamma,qgamma,rgamma # dnorm,pnorm,qnorm,rnorm |
This module covers conjugate and objective Bayesian analysis for continuous data.
- Lesson 9 presents the conjugate model for exponentially distributed data.
- Lesson 10 discusses models for normally distributed data, which play a central role in statistics.
- In Lesson 11, we return to prior selection and discuss ‘objective’ or ‘non-informative’ priors.
- Lesson 12 presents Bayesian linear regression with non-informative priors, which yield results comparable to those of classical regression.
For example, suppose you're waiting for a bus that you think comes on average once every ten minutes, but you're not sure exactly how often it comes.
gamma distribution is conjugate for an exponential likelihood. Gammas actually are conjugate for a number of different things.
一文包含所有:Probability concepts explained: Bayesian inference for parameter estimation.
所谓高手,就是把自己活成了贝叶斯定理 - 他的引入和案例非常好,只是深究的话有些问题。
先验:some knowledge or belief that we already have (commonly known as the prior),不用太复杂,先验就是指我们已经获得的知识,通常是marginal probability。 P(A) is a prior to me knowing anything about the B. 先验可以是猜测的,可以包含一定的主观因素。更规范一点我们的P(A)不是一个固定值,而是一个分布,prior distribution。
后验:是指我们得到一些新的数据后,我们原猜测发生的概率,相当于是对原先主观先验的一个更新。P(Θ|data) on the left hand side is known as the posterior distribution. This is the distribution representing our belief about the parameter values after we have calculated everything on the right hand side taking the observed data into account.
核心:Therefore we can calculate the posterior distribution of our parameters using our prior beliefs updated with our likelihood.
我们用这两组邮件,对过滤器进行"训练"。这两组邮件的规模越大,训练效果就越好。Paul Graham使用的邮件规模,是正常邮件和垃圾邮件各4000封。
"训练"过程很简单。首先,解析所有邮件,提取每一个词。然后,计算每个词语在正常邮件和垃圾邮件中的出现频率。比如,我们假定"sex"这个词,在4000封垃圾邮件中,有200封包含这个词,那么它的出现频率就是5%;而在4000封正常邮件中,只有2封包含这个词,那么出现频率就是0.05%。(【注释】如果某个词只出现在垃圾邮件中,Paul Graham就假定,它在正常邮件的出现频率是1%,反之亦然。这样做是为了避免概率为0。随着邮件数量的增加,计算结果会自动调整。)
参考:贝叶斯线性回归(Bayesian Linear Regression)
