分层贝叶斯学习

频率推理（Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics）

This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based.

Other than frequentistic inference, the main alternative approach to statistical inference is Bayesian inference, while another is fiducial inference.

two major differences in the frequentist and Bayesian approaches to inference that are not included in the above consideration of the interpretation of probability：

····In a frequentist approach to inference, unknown parameters are often, but not always, treated as having fixed but unknown values that are not capable of being treated as random variates in any sense, and hence there is no way that probabilities can be associated with them. making operational decisions and estimating parameters with or without confidence intervals. Frequentist inference is based solely on the (one set of) evidence

In contrast, a Bayesian approach to inference does allow probabilities to be associated with unknown parameters, where these probabilities can sometimes have a frequency probability interpretation as well as a Bayesian one. The Bayesian approach allows these probabilities to have an interpretation as representing the scientist's belief that given values of the parameter are true。

Bayesian inference is explicitly based on the evidence and prior opinion, which allows it to be based on multiple sets of evidence.

···While "probabilities" are involved in both approaches to inference, the probabilities are associated with different types of things. The result of a Bayesian approach can be a probability distribution for what is known about the parameters given the results of the experiment or study. The result of a frequentist approach is either a "true or false" conclusion from a significance test or a conclusion in the form that a given sample-derived confidence interval covers the true value: either of these conclusions has a given probability of being correct, where this probability has either a frequency probability interpretation or a pre-experiment interpretation.

Example: A frequentist does not say that there is a 95% probability that the true value of a parameter lies within a confidence interval, saying instead that 95% of confidence intervals contain the true value.

Efron's comparative adjectives
	Bayes	Frequentist
Basis Resulting Characteristic _ Ideal Application Target Audience Modeling Characteristic	Belief (prior) Principled Philosophy One distribution Dynamic (repeated sampling) Individual (subjective) Aggressive	Behavior (method) Opportunistic Methods Many distributions (bootstrap?) Static (one sample) Community (objective) Defensive

概率与似然

A probability refers to variable data for a fixed hypothesis while a likelihood refers to variable hypotheses for a fixed set of data.

Each fixed set of observational conditions is associated with a probability distribution and each set of observations can be interpreted as a sample from that distribution – the frequentist view of probability.

Alternatively a set of observations may result from sampling any of a number of distributions (each resulting from a set of observational conditions). The probabilistic relationship between a fixed sample and a variable distribution (resulting from a variable hypothesis) is termed likelihood – a Bayesian view of probability。

The principle says that all of the information in a sample is contained in the likelihood function, which is accepted as a valid probability distribution by Bayesians (but not by frequentists).

many statisticians accept the cautionary words of statistician George Box, "All models are wrong, but some are useful."

Bayes’ theorem

The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options.

Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability $\theta _{j}$

In order to make updated probability statements about $\theta _{j}$

P(\theta ,y)=P(\theta )P(y\mid \theta )

Using the basic property of conditional probability, the posterior distribution will yield:

P(\theta \mid y)={\frac {P(\theta ,y)}{P(y)}}={\frac {P(y\mid \theta )P(\theta )}{P(y)}}

This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to incorporate the updated belief, $P(\theta \mid y)$

$Exchangeability$

$Finite exchangeability$

$x_{1},\ldots ,x_{n}$

$[P(y_2=1\mid y_1=1)=0 \ne P(y_2=1)= \frac{1}{2}]$

$y_{1}$

Infinite exchangeability

Hierarchical models

Components

Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely:

1. Hyperparameter: parameter of the prior distribution

2. Hyperprior: distribution of a Hyperparameter

Say a random variable Y follows a normal distribution with parameters θ as the mean and 1 as the variance, that is $Y\mid \theta \sim N(\theta ,1)$

$Y\mid \theta ,\mu \sim N(\theta ,1)$

$Framework$

$y_{j}$

$\theta _{1},\theta _{2},\ldots ,\theta _{j}$ $\phi$

${\text{Stage I: }}y_{j}\mid \theta _{j},\phi \sim P(y_{j}\mid \theta _{j},\phi )$

{\text{Stage II: }}\theta _{j}\mid \phi \sim P(\theta _{j}\mid \phi )

{\text{Stage III: }}\phi \sim P(\phi )

The likelihood, as seen in stage I is $P(y_{j}\mid \theta _{j},\phi )$

The prior distribution from stage I can be broken down into:

$P(\theta _{j},\phi )=P(\theta _{j}\mid \phi )P(\phi )$ [from the definition of conditional probability]

With $\phi$

Thus, the posterior distribution is proportional to:

P(\phi ,\theta _{j}\mid y)\propto P(y_{j}\mid \theta _{j},\phi )P(\theta _{j}\mid \phi )

P(\phi ,\theta _{j}\mid y)\propto P(y_{j}\mid \theta _{j})P(\theta _{j},\phi )

Example

To further illustrate this, consider the example: A teacher wants to estimate how well a male student did in his SAT. He uses information on the student’s high school grades and his current grade point average (GPA) to come up with an estimate. His current GPA, denoted by $Y$

P(\theta ,\phi \mid Y)\propto P(Y\mid \theta ,\phi )P(\theta ,\phi )

P(\theta ,\phi \mid Y)\propto P(Y\mid \theta )P(\theta \mid \phi )P(\phi )

All information in the problem will be used to solve for the posterior distribution. Instead of solving only using the prior distribution and the likelihood function, the use of hyperpriors gives more information to make more accurate beliefs in the behavior of a parameter.