分层贝叶斯学习

频率推理(Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics

This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based.

Other than frequentistic inference, the main alternative approach to statistical inference is Bayesian inference, while another is fiducial inference.

two major differences in the frequentist and Bayesian approaches to inference that are not included in the above consideration of the interpretation of probability:

····In a frequentist approach to inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates in any sense, and hence there is no way that probabilities can be associated with them. making operational decisions and estimating parameters with or without confidence intervals. Frequentist inference is based solely on the (one set of) evidence

In contrast, a Bayesian approach to inference does allow probabilities to be associated with unknown parameters, where these probabilities can sometimes have a frequency probability interpretation as well as a Bayesian one. The Bayesian approach allows these probabilities to have an interpretation as representing the scientist's belief that given values of the parameter are true。

Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

···While "probabilities" are involved in both approaches to inference, the probabilities are associated with different types of things. The result of a Bayesian approach can be a probability distribution for what is known about the parameters given the results of the experiment or study. The result of a frequentist approach is either a "true or false" conclusion from a significance test or a conclusion in the form that a given sample-derived confidence interval covers the true value: either of these conclusions has a given probability of being correct, where this probability has either a frequency probability interpretation or a pre-experiment interpretation.

 

Example: A frequentist does not say that there is a 95% probability that the true value of a parameter lies within a confidence interval, saying instead that 95% of confidence intervals contain the true value.

 

Efron's comparative adjectives
 BayesFrequentist
  • Basis
  • Resulting Characteristic
  • _
  • Ideal Application
  • Target Audience
  • Modeling Characteristic
  • Belief (prior)
  • Principled Philosophy
  • One distribution
  • Dynamic (repeated sampling)
  • Individual (subjective)
  • Aggressive
  • Behavior (method)
  • Opportunistic Methods
  • Many distributions (bootstrap?)
  • Static (one sample)
  • Community (objective)
  • Defensive

概率与似然

A probability refers to variable data for a fixed hypothesis while a likelihood refers to variable hypotheses for a fixed set of data.

Each fixed set of observational conditions is associated with a probability distribution and each set of observations can be interpreted as a sample from that distribution – the frequentist view of probability.

Alternatively a set of observations may result from sampling any of a number of distributions (each resulting from a set of observational conditions). The probabilistic relationship between a fixed sample and a variable distribution (resulting from a variable hypothesis) is termed likelihood – a Bayesian view of probability。

The principle says that all of the information in a sample is contained in the likelihood function, which is accepted as a valid probability distribution by Bayesians (but not by frequentists).

 

many statisticians accept the cautionary words of statistician George Box, "All models are wrong, but some are useful."

Bayes’ theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability \theta _{j}, the survival probability will be updated with the occurrence of y, the event in which a hypothetical controversial serum is created which, as believed by some, increases survival in cardiac patients.

In order to make updated probability statements about  \theta _{j}, given the occurrence of event y, we must begin with a model providing a joint probability distribution for  \theta _{j} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution  P(\theta ) and the sampling distribution P(y\mid \theta ) respectively:

P(\theta ,y)=P(\theta )P(y\mid \theta )

Using the basic property of conditional probability, the posterior distribution will yield:

P(\theta \mid y)={\frac  {P(\theta ,y)}{P(y)}}={\frac  {P(y\mid \theta )P(\theta )}{P(y)}}

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief,  P(\theta \mid y), in appropriate and solvable ways.

Exchangeability

 

 

Finite exchangeability

If x_{1},\ldots ,x_{n} are independent and identically distributed, then they are exchangeable, but not conversely true。比如:一个盒子里有篮球和红球。那么先拿到红球和先拿到篮球的概率都是1/2。

But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first draw is 0, and is not equal to the probability that the red ball is selected in the second draw which is equal to 1/2  [P(y_2=1\mid y_1=1)=0 \ne P(y_2=1)=  \frac{1}{2}]).

Thus, y_{1} and  y_{2} are not independent。

Infinite exchangeability

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely:

1. Hyperparameter: parameter of the prior distribution

2. Hyperprior: distribution of a Hyperparameter

Say a random variable Y follows a normal distribution with parameters θ as the mean and 1 as the variance, that is Y\mid \theta \sim N(\theta ,1). The parameter  \theta has a prior distribution given by a normal distribution with mean  \mu and variance 1, i.e.  \theta \mid \mu \sim N(\mu ,1). Furthermore,  \mu follows another distribution given, for example, by the standard normal distribution {\text{N}}(0,1). The parameter \mu is called the hyperparameter, while its distribution given by  {\text{N}}(0,1) is an example of a hyperprior distribution.

The notation of the distribution of Y changes as another parameter is added, i.e.Y\mid \theta ,\mu \sim N(\theta ,1). If there is another stage, say,  \mu follows another normal distribution with mean \beta and variance  \epsilon , meaning  \mu \sim N(\beta ,\epsilon )   \beta and  \epsilon can also be called hyperparameters while their distributions are hyperprior distributions as well.

Framework

Let y_{j} be an observation and  \theta _{j} a parameter governing the data generating process for  y_{j}.

Assume further that the parameters  \theta _{1},\theta _{2},\ldots ,\theta _{j} are generated exchangeably from a common population, with distribution governed by a hyperparameter \phi .
The Bayesian hierarchical model contains the following stages:

{\text{Stage I: }}y_{j}\mid \theta _{j},\phi \sim P(y_{j}\mid \theta _{j},\phi )

{\text{Stage II: }}\theta _{j}\mid \phi \sim P(\theta _{j}\mid \phi )
{\text{Stage III: }}\phi \sim P(\phi )

The likelihood, as seen in stage I is  P(y_{j}\mid \theta _{j},\phi ), with  P(\theta _{j},\phi ) as its prior distribution. Note that the likelihood depends on \phi only through  \theta _{j}.

The prior distribution from stage I can be broken down into:

P(\theta _{j},\phi )=P(\theta _{j}\mid \phi )P(\phi ) [from the definition of conditional probability]

With\phi as its hyperparameter with hyperprior distribution, P(\phi).

Thus, the posterior distribution is proportional to:

P(\phi ,\theta _{j}\mid y)\propto P(y_{j}\mid \theta _{j},\phi )P(\theta _{j}\mid \phi ) [using Bayes’ Theorem]
P(\phi ,\theta _{j}\mid y)\propto P(y_{j}\mid \theta _{j})P(\theta _{j},\phi )

Example

To further illustrate this, consider the example: A teacher wants to estimate how well a male student did in his SAT. He uses information on the student’s high school grades and his current grade point average (GPA) to come up with an estimate. His current GPA, denoted by Y, has a likelihood given by some probability function with parameter  \theta , i.e.  Y\mid \theta \sim P(Y\mid \theta ). This parameter  \theta is the SAT score of the student. The SAT score is viewed as a sample coming from a common population distribution indexed by another parameter \phi , which is the high school grade of the student.That is,  \theta \mid \phi \sim P(\theta \mid \phi ). Moreover, the hyperparameter  \phi follows its own distribution given by P(\phi), a hyperprior. To solve for the SAT score given information on the GPA,

P(\theta ,\phi \mid Y)\propto P(Y\mid \theta ,\phi )P(\theta ,\phi )
P(\theta ,\phi \mid Y)\propto P(Y\mid \theta )P(\theta \mid \phi )P(\phi )

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter.

2-stage hierarchical model

In general, the joint posterior distribution of interest in 2-stage hierarchical models is:

P(\theta ,\phi \mid Y)={P(Y\mid \theta ,\phi )P(\theta ,\phi ) \over P(Y)}={P(Y\mid \theta )P(\theta \mid \phi )P(\phi ) \over P(Y)}
P(\theta ,\phi \mid Y)\propto P(Y\mid \theta )P(\theta \mid \phi )P(\phi )

3-stage hierarchical model

For 3-stage hierarchical models, the posterior distribution is given by:

P(\theta ,\phi ,X\mid Y)={P(Y\mid \theta )P(\theta \mid \phi )P(\phi \mid X)P(X) \over P(Y)}
P(\theta ,\phi ,X\mid Y)\propto P(Y\mid \theta )P(\theta \mid \phi )P(\phi \mid X)P(X)
posted @ 2017-05-17 17:34  JoAnna_L  阅读(2007)  评论(0编辑  收藏  举报