频率主义与贝叶斯主义 IV：Python的贝叶斯工具

这个notebook出自Pythonic Perambulations的博文。The content is BSD licensed.

这个系列共4个部分：中文版Part I Part II Part III Part IV，英文版Part I Part II Part III Part IV

我之前花了一堆时间来分享这两种思想。

频率主义与贝叶斯主义 I: 实用介绍，论述了频率主义和贝叶斯主义在科学计算方面的差异，在简单问题的处理方面两者基本等价，会得到相同的点估计(point estimates)结果。
频率主义 vs 贝叶斯主义 II：当结果不同时，进一步介绍两者的差异，尤其是处理冗余参数方面的差异。
频率主义 vs 贝叶斯主义 III：置信与可信，频率主义与科学，不能混为一谈，介绍了频率主义置信区间与贝叶斯主义可信范围的差异，指出在科学研究中频率主义只会回答错误的问题。

这一篇文章，我打算撇开哪些争论，谈谈Python实现贝叶斯研究的工具。现代贝叶斯主义的核心是马尔可夫链蒙特卡罗算法，MCMC，样本后验分布(sample posterior distributions)的高效算法。

下面我就用Python的三个包来演示MCMC:

emcee: the MCMC Hammer
pymc: Bayesian Statistical Modeling in Python
pystan: The Python Interface to Stan

我不会太关心它们的性能，也不会比较各自的API。这篇文章也不是教程；这三个包都有很完整的文档和教程。我要做的是比较一下三者的用法。用一个相对简单的例子，演示三个包的解法，希望对你有所帮助。

测试案例：线性拟合最优解

为了解决问题，我们做一个三参数模型来找数据的线性拟合解。参数是斜率，截距和相关性(scatter about the line)；这个相关性可以当作冗余参数

数据

先来定义一些数据

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

np.random.seed(42)
theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata

# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)

In [2]:

plt.plot(xdata, ydata, 'ok')
plt.xlabel('x')
plt.ylabel('y');

数据明显具有相关性，我们假设我们不知道误差。让我们构建一条直线来拟合。

模型：斜率，截距和未知相关系数

由贝叶斯定义可得

P (θ | D) \propto P (D | θ) P (θ)

其中，

假设线性模型有斜率

y^(x i | α, β) = α + β x i

假设y轴坐标具有正态分布误差，那么模型下的任何数据点的概率为：

P (x i, y i | α, β, σ) = 1 2 π σ 2----\sqrt exp [ - [ y i - y

其中，

所以

P ({x i}, {y i} | α, β, σ) \propto (2 π σ 2) - N / 2 exp [-

先验条件

这里我们比之前的更仔细地选择先验条件。我们可以简单的认为

为什么flat prior可能产生坏的选择？看看斜率就明白了。让我们来演示一下斜率从0-10的直线变化情况；

In [3]:

fig, ax = plt.subplots(subplot_kw=dict(aspect='equal'))
x = np.linspace(-1, 1)

for slope in np.arange(0, 10, 0.1):
    plt.plot(x, slope * x, '-k')

ax.axis([-1, 1, -1, 1], aspect='equal');

这些线按0.1的斜率间隔进行分布，高斜率区域的线十分密集。因为是扁平先验，你肯定会认为这些斜率彼此无差异。由上面的密集区域显然可以看出，斜率的扁平先验满足那些很陡的斜率。斜率的扁平先验不是一个最小化的无信息先验(minimally informative prior)，可能是最终使你的结果有偏差(尽管有足够多的数据，结果可能几乎是0)。

你可能想动手做出一个更好的方案(可能在直线与x轴的夹角

斜率和截距的先验

假设模型为

y = α + β x

那么构建参数空间(parameter-space)，其概率元素为

因为

x = α' + β' y

其概率元素为

(α', β') = (- β - 1 α, β - 1) .

通过雅可比变换(the Jacobian of the transformation)，可得

Q (α', β') = β 3 P (α, β) .

为了保持对称性，需要保证变量的改变不会影响先验概率，因此有：

β 3 P (α, β) = P (- β - 1 α, β - 1) .

此函数满足：

P (α, β) \propto (1 + β 2) - 3 / 2 .

即

你可能会奇怪，斜率的分布服从参数为

同理，我们希望

P (σ) d σ = P (σ / c) d σ / c .

方程等价于

P (σ) \propto 1 / σ .

这就是Harold Jeffreys提出的Jeffreys先验。

把先验放一起

把它们放到一起，我们用这些对称性参数推导出模型的最小无信息先验：

P (α, β, σ) \propto 1 σ ( 1 + β 2 ) - 3 / 2

综上所述，你可能用扁平先验做参数变换

用Python解决问题

现在有了数据，似然估计和先验，让我们用Python的emcee，PyMC和PyStan来演示。首先，让我们做一些辅助工作来可视化数据：

In [4]:

# Create some convenience routines for plotting

def compute_sigma_level(trace1, trace2, nbins=20):
    """From a set of traces, bin by number of standard deviations"""
    L, xbins, ybins = np.histogram2d(trace1, trace2, nbins)
    L[L == 0] = 1E-16
    logL = np.log(L)

    shape = L.shape
    L = L.ravel()

    # obtain the indices to sort and unsort the flattened array
    i_sort = np.argsort(L)[::-1]
    i_unsort = np.argsort(i_sort)

    L_cumsum = L[i_sort].cumsum()
    L_cumsum /= L_cumsum[-1]
    
    xbins = 0.5 * (xbins[1:] + xbins[:-1])
    ybins = 0.5 * (ybins[1:] + ybins[:-1])

    return xbins, ybins, L_cumsum[i_unsort].reshape(shape)


def plot_MCMC_trace(ax, xdata, ydata, trace, scatter=False, **kwargs):
    """Plot traces and contours"""
    xbins, ybins, sigma = compute_sigma_level(trace[0], trace[1])
    ax.contour(xbins, ybins, sigma.T, levels=[0.683, 0.955], **kwargs)
    if scatter:
        ax.plot(trace[0], trace[1], ',k', alpha=0.1)
    ax.set_xlabel(r'$\alpha$')
    ax.set_ylabel(r'$\beta$')
    
    
def plot_MCMC_model(ax, xdata, ydata, trace):
    """Plot the linear model and 2sigma contours"""
    ax.plot(xdata, ydata, 'ok')

    alpha, beta = trace[:2]
    xfit = np.linspace(-20, 120, 10)
    yfit = alpha[:, None] + beta[:, None] * xfit
    mu = yfit.mean(0)
    sig = 2 * yfit.std(0)

    ax.plot(xfit, mu, '-k')
    ax.fill_between(xfit, mu - sig, mu + sig, color='lightgray')

    ax.set_xlabel('x')
    ax.set_ylabel('y')


def plot_MCMC_results(xdata, ydata, trace, colors='k'):
    """Plot both the trace and the model together"""
    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
    plot_MCMC_trace(ax[0], xdata, ydata, trace, True, colors=colors)
    plot_MCMC_model(ax[1], xdata, ydata, trace)

下面，就让我们来解决MCMC。

Emcee

emcee(就是Python历史上著名的MCMC Hammer)是天文学家Dan Foreman-Mackey写的纯Python包。这个简洁的包实现了复杂的仿射无关哈密顿(Affine-invariant Hamiltonian)MCMC。因为是纯Python，所以包很容易安装：

[~]$ pip install emcee

Emcee没有很多引用代码；只要传入Python函数就会返回与对数后验概率匹配的值和后验样本。下面用emcee演示一下：

In [5]:

import emcee
print(emcee.__version__)

2.1.0

In [6]:

# Define our posterior using Python functions
# for clarity, I've separated-out the prior and likelihood
# but this is not necessary. Note that emcee requires log-posterior

# 用Python函数定义后验，我把先验和似然估计分开写了，其实没必要，主要是显得更简洁。
# 注意emcee需要对数后验证

def log_prior(theta):
    alpha, beta, sigma = theta
    if sigma < 0:
        return -np.inf  # log(0)
    else:
        return -1.5 * np.log(1 + beta ** 2) - np.log(sigma)

def log_likelihood(theta, x, y):
    alpha, beta, sigma = theta
    y_model = alpha + beta * x
    return -0.5 * np.sum(np.log(2 * np.pi * sigma ** 2) + (y - y_model) ** 2 / sigma ** 2)

def log_posterior(theta, x, y):
    return log_prior(theta) + log_likelihood(theta, x, y)

In [7]:

# Here we'll set up the computation. emcee combines multiple "walkers",
# each of which is its own MCMC chain. The number of trace results will
# be nwalkers * nsteps

# 设置计算参数。emcee组合了多个"walkers"，每个都有自己的MCMC链。
# 跟踪结果的数量为nwalkers * nsteps

ndim = 3  # number of parameters in the model
nwalkers = 50  # number of MCMC walkers
nburn = 1000  # "burn-in" period to let chains stabilize
nsteps = 2000  # number of MCMC steps to take

# set theta near the maximum likelihood, with 
np.random.seed(0)
starting_guesses = np.random.random((nwalkers, ndim))

In [8]:

# Here's the function call where all the work happens:
# we'll time it using IPython's %time magic

# 这就是所有调用内容。
# 这里用IPythn的%time方法计时。

sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[xdata, ydata])
%time sampler.run_mcmc(starting_guesses, nsteps)
print("done")

CPU times: user 4.38 s, sys: 1.09 ms, total: 4.38 s
Wall time: 4.38 s
done

In [9]:

# sampler.chain is of shape (nwalkers, nsteps, ndim)
# we'll throw-out the burn-in points and reshape:

# sampler.chain返回数组维度为(nwalkers, nsteps, ndim)

sampler.chain
emcee_trace = sampler.chain[:, nburn:, :].reshape(-1, ndim).T
plot_MCMC_results(xdata, ydata, emcee_trace)

左图，我们演示了由冗余参数

PyMC

PyMC包比emcee有更多的特征，有效的支持许多主流先验分布。 PyMC默认使用经典的Metropolis-Hastings取样器，最早的MCMC算法。性能方面，使用了Fortran库，因此用pip安装就有点复杂。先安装Fortran编译器，pip install pymc可以了。如果没装Fortran编译器，可以通过conda安装。

据说PyMC第三版要移除Fortran依赖，安装就容易了。而且PyMC 3的API更简洁，性能更好。但是目前仍是alpha版本，所以现在用就版本：

In [10]:

import pymc
print(pymc.__version__)

2.3.3

In [11]:

# Define the variables needed for the routine, with their prior distributions
alpha = pymc.Uniform('alpha', -100, 100)

@pymc.stochastic(observed=False)
def beta(value=0):
    return -1.5 * np.log(1 + value ** 2)

@pymc.stochastic(observed=False)
def sigma(value=1):
    return -np.log(abs(value))

# Define the form of the model and likelihood
@pymc.deterministic
def y_model(x=xdata, alpha=alpha, beta=beta):
    return alpha + beta * x

y = pymc.Normal('y', mu=y_model, tau=1. / sigma ** 2, observed=True, value=ydata)

# package the full model in a dictionary
model1 = dict(alpha=alpha, beta=beta, sigma=sigma,
              y_model=y_model, y=y)

In [12]:

# run the basic MCMC: we'll do 100000 iterations to match emcee above
S = pymc.MCMC(model1)
S.sample(iter=100000, burn=50000)

 [-----------------100%-----------------] 100000 of 100000 complete in 55.8 sec

In [13]:

# extract the traces and plot the results
pymc_trace = [S.trace('alpha')[:],
              S.trace('beta')[:],
              S.trace('sigma')[:]]

plot_MCMC_results(xdata, ydata, pymc_trace)

结果和emcee很相似。

PyStan

PyStan项目是C++实现的Stan概率编程语言的Python封装版。使用No U-Turn取样器，比Metropolis-Hastings和Gibbs取样更复杂。不考虑API，PyStan与emcee和PyMC的差异是，它要求在Python程序内编写并编译非Python编码。

因为PyStan依赖Stan包，安装很困难。为了用Python封装版需要安装全部的Stan库。如果你有C/C++编译环境，直接pip install pystan就可以搞定。

好像Stan也提供已经编译版本，支持conda。因此，如果你没有C/C++编译环境，安装可能比较麻烦。

我用的PyStan 2.5版本：

In [14]:

import pystan
print(pystan.__version__)

2.5.0.0

还有一点诡异的是：PyStan在IPython notebook上运行容易崩溃，直接执行Python文件就没事。不明觉厉，但是这里可能是解法。所以，安全起见，我直接在命令行运行Python文件，结果写在盘里：

In [15]:

%%file pystan_example.py

import numpy as np
import pystan

#---------------------------------------------
# Generate data (same as used in the notebook)

np.random.seed(42)
theta_true = (25, 0.5)
xdata = 100 * np.random.random(20)
ydata = theta_true[0] + theta_true[1] * xdata

# add scatter to points
xdata = np.random.normal(xdata, 10)
ydata = np.random.normal(ydata, 10)

#----------------------------------------------
# Create the Stan model
#  this is done by defining a string of Stan code.

fit_code = """
data {
    int<lower=0> N; // number of points
    real x[N]; // x values
    real y[N]; // y values
}

parameters {
    real alpha_perp;
    real<lower=-pi()/2, upper=pi()/2> theta;
    real log_sigma;
}

transformed parameters {
    real alpha;
    real beta;
    real sigma;
    real ymodel[N];
    
    alpha <- alpha_perp / cos(theta);
    beta <- sin(theta);
    sigma <- exp(log_sigma);
    for (j in 1:N)
    ymodel[j] <- alpha + beta * x[j];
}

model {
    y ~ normal(ymodel, sigma);
}
"""

# perform the fit
fit_data = {'N': len(xdata), 'x': xdata, 'y': ydata}
fit = pystan.stan(model_code=fit_code, data=fit_data, iter=25000, chains=4)

# extract the traces
traces = fit.extract()
pystan_trace = [traces['alpha'], traces['beta'], traces['sigma']]

# save the traces with numpy
np.save("pystan_trace.npy", pystan_trace)

Writing pystan_example.py

In [16]:

# run the code we've created on the command-line
!python pystan_example.py

INFO:pystan:COMPILING THE C++ CODE FOR MODEL anon_model_c1dba2ed7f485b674d7ce5eb738ffe05 NOW.
Iteration:     1 / 25000 [  0%]  (Warmup) (Chain 3)
Iteration:     1 / 25000 [  0%]  (Warmup) (Chain 2)
Iteration:     1 / 25000 [  0%]  (Warmup) (Chain 1)
Iteration:     1 / 25000 [  0%]  (Warmup) (Chain 0)
Iteration:  2500 / 25000 [ 10%]  (Warmup) (Chain 1)
Iteration:  2500 / 25000 [ 10%]  (Warmup) (Chain 2)
Iteration:  2500 / 25000 [ 10%]  (Warmup) (Chain 3)
Iteration:  2500 / 25000 [ 10%]  (Warmup) (Chain 0)
Iteration:  5000 / 25000 [ 20%]  (Warmup) (Chain 1)
Iteration:  5000 / 25000 [ 20%]  (Warmup) (Chain 2)
Iteration:  5000 / 25000 [ 20%]  (Warmup) (Chain 3)
Iteration:  7500 / 25000 [ 30%]  (Warmup) (Chain 1)
Iteration:  7500 / 25000 [ 30%]  (Warmup) (Chain 2)
Iteration:  5000 / 25000 [ 20%]  (Warmup) (Chain 0)
Iteration:  7500 / 25000 [ 30%]  (Warmup) (Chain 3)
Iteration: 10000 / 25000 [ 40%]  (Warmup) (Chain 1)
Iteration: 10000 / 25000 [ 40%]  (Warmup) (Chain 2)
Iteration: 10000 / 25000 [ 40%]  (Warmup) (Chain 3)
Iteration: 12500 / 25000 [ 50%]  (Warmup) (Chain 1)
Iteration: 12501 / 25000 [ 50%]  (Sampling) (Chain 1)
Iteration:  7500 / 25000 [ 30%]  (Warmup) (Chain 0)
Iteration: 12500 / 25000 [ 50%]  (Warmup) (Chain 2)
Iteration: 12501 / 25000 [ 50%]  (Sampling) (Chain 2)
Iteration: 10000 / 25000 [ 40%]  (Warmup) (Chain 0)
Iteration: 15000 / 25000 [ 60%]  (Sampling) (Chain 1)
Iteration: 12500 / 25000 [ 50%]  (Warmup) (Chain 3)
Iteration: 12501 / 25000 [ 50%]  (Sampling) (Chain 3)
Iteration: 15000 / 25000 [ 60%]  (Sampling) (Chain 2)
Iteration: 17500 / 25000 [ 70%]  (Sampling) (Chain 1)
Iteration: 15000 / 25000 [ 60%]  (Sampling) (Chain 3)
Iteration: 17500 / 25000 [ 70%]  (Sampling) (Chain 2)
Iteration: 12500 / 25000 [ 50%]  (Warmup) (Chain 0)
Iteration: 12501 / 25000 [ 50%]  (Sampling) (Chain 0)
Iteration: 20000 / 25000 [ 80%]  (Sampling) (Chain 1)
Iteration: 20000 / 25000 [ 80%]  (Sampling) (Chain 2)
Iteration: 15000 / 25000 [ 60%]  (Sampling) (Chain 0)
Iteration: 17500 / 25000 [ 70%]  (Sampling) (Chain 3)
Iteration: 22500 / 25000 [ 90%]  (Sampling) (Chain 1)
Iteration: 22500 / 25000 [ 90%]  (Sampling) (Chain 2)
Iteration: 17500 / 25000 [ 70%]  (Sampling) (Chain 0)
Iteration: 25000 / 25000 [100%]  (Sampling) (Chain 1)

#  Elapsed Time: 0.608499 seconds (Warm-up)
#                0.684071 seconds (Sampling)
#                1.29257 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling) (Chain 2)

#  Elapsed Time: 0.620206 seconds (Warm-up)
#                0.72414 seconds (Sampling)
#                1.34435 seconds (Total)

Iteration: 20000 / 25000 [ 80%]  (Sampling) (Chain 3)
Iteration: 20000 / 25000 [ 80%]  (Sampling) (Chain 0)
Iteration: 22500 / 25000 [ 90%]  (Sampling) (Chain 3)
Iteration: 22500 / 25000 [ 90%]  (Sampling) (Chain 0)
Iteration: 25000 / 25000 [100%]  (Sampling) (Chain 3)

#  Elapsed Time: 0.610114 seconds (Warm-up)
#                0.657225 seconds (Sampling)
#                1.26734 seconds (Total)

Iteration: 25000 / 25000 [100%]  (Sampling) (Chain 0)

#  Elapsed Time: 0.616008 seconds (Warm-up)
#                0.687802 seconds (Sampling)
#                1.30381 seconds (Total)

可以看到，运行100,000样本约6秒。另外，在我的电脑上，模型运行前有约20秒准备时间。

In [17]:

# load the results from file; plot as above
pystan_trace = np.load('pystan_trace.npy')
plot_MCMC_results(xdata, ydata, pystan_trace)

可见，结果和上面都差不多。

总结：结果比较

下面我们画图来比较一下三个结果：

In [18]:

fig, ax = plt.subplots(figsize=(8, 8))
plot_MCMC_trace(ax, xdata, ydata, emcee_trace, True,
                colors='blue', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pymc_trace,
                colors='red', linewidths=2)
plot_MCMC_trace(ax, xdata, ydata, pystan_trace,
                colors='green', linewidths=2)
ax.legend(ax.collections[::2], ['emcee', 'pymc', 'pystan'], fontsize=16);

如我所料，结果都差不多！这表明我们定义的模型在这三个包都一致。另外，我们看到用来产生分布的“真”值(25, 0.5)落在1-

三个包比较

哪个包好？因地制宜。下表是我自己的总结：

	复杂性	执行时间(100,000样本，包括准备阶段)	安装简易性	使用友好程度	特性丰富程度
emcee	轻量版	~6秒	纯Python；pip安装简单	直截了当 & Pythonic	除了MCMC取样没太多功能
pymc2	大量功能 & 选项	~17秒	要求fortran编译器; 可通过conda编译	Pythonic，但有大量的pymc引用文件	大量的内建Python功能
pystan	很多包; 要求Stan语言编写	~20秒编译 + ~6秒计算	要求C编译器 + Stan安装; 没有编译好的文件	不是纯Python; 必须学习Stan建模语言	大量Stan功能

啰嗦几句：

emcee可谓短小精悍。你要做的所有工作就是Python对数后验，之后emcee就会从分布取样了。因为是纯Python而且没有许多常见的分布(如均匀分布，正态分布等)。我觉得它可能比另两个包慢，但是性能差的并不多。可能是取样算法相对简单，所以跑分不差。

pymc功能全面，一旦你明白了修饰器语法，分布函数，派生变量(derived quantities)，那么还是容易用的。它的性能比其他两个都差：同样的查询花更多时间，尽管不同的先验取样都优化过。听说PyMC 3是完全重写，还是值得期待的。

pystan是最难用的，但那是因为不是真Python包。模型也不是Python的，是一种统计语言。这种语言很给力，但是有点难学。约20秒编译时间挺烦人，但我认为随着样本增大模型会更复杂，这点儿可以忽略。Stan是为这种工作设计的，模型都被编译成字节码，在做大case的时候是个不错的选择。

这里就不花太多时间了；前面说过，这些包用了不同的取样算法所以时间是不可比的。每种取样算法都有优缺点，你可以去看看。

希望我的比较是公正合理的。当然我不是这些包的大牛；只是简单读了一些文档，然后把教程的例子改改罢了。如果大牛碰巧读了这些，轻喷哈。再说两件事：

如果你用的更好，写点儿评论哈！
提醒一下那些想用这包解决问题的一般用户：读读教程，因地制宜。

感谢御览！

这篇博客是IPython notebook写的。可以[下载](http://jakevdp.github.io/downloads/notebooks/FreqBayes4.ipynb)，或者看[nbviewer](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/FreqBayes4.ipynb)静态网页。十分感谢David W. Hogg对本文提出的建议。

菜鸡一枚

frequentism-and-bayesianism-chs-iv

frequentism-and-bayesianism-chs-iv

频率主义与贝叶斯主义 IV：Python的贝叶斯工具

测试案例：线性拟合最优解

数据

模型：斜率，截距和未知相关系数

先验条件

斜率和截距的先验

把先验放一起

用Python解决问题

Emcee

PyMC

PyStan

总结：结果比较

三个包比较

公告

菜鸡一枚

frequentism-and-bayesianism-chs-iv

frequentism-and-bayesianism-chs-iv

频率主义与贝叶斯主义 IV：Python的贝叶斯工具

测试案例：线性拟合最优解

数据

模型：斜率，截距和未知相关系数

先验条件

斜率和截距的先验

σ的先验

把先验放一起

用Python解决问题

Emcee

PyMC

PyStan

总结：结果比较

三个包比较

公告