[Scikit-learn] 2.1 Clustering - Variational Bayesian Gaussian Mixture

最重要的一点是:Bayesian GMM为什么拟合的更好? 

PRML 这段文字做了解释:

 

Ref: http://freemind.pluskid.org/machine-learning/deciding-the-number-of-clusterings/ 

链接中提到了一些其他的无监督聚类。

 

From: http://scikit-learn.org/stable/modules/mixture.html#variational-bayesian-gaussian-mixture

Due to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization,

the most important of these being the concentration parameter weight_concentration_prior.

  • Specifying a low value for the concentration prior will make the model put most of the weight on few components set the remaining components weights very close to zero.
  • High values of the concentration prior will allow a larger number of components to be active in the mixture.

 

The examples below compare Gaussian mixture models with a fixed number of components, to the variational Gaussian mixture models with a Dirichlet process prior. Here, a classical Gaussian mixture is fitted with 5 components on a dataset composed of 2 clusters.

We can see that the variational Gaussian mixture with a Dirichlet process prior is able to limit itself to only 2 components whereas the Gaussian mixture fits the data with a fixed number of components that has to be set a priori by the user. In this case the user has selected n_components=5 which does not match the true generative distribution of this toy dataset. Note that with very little observations, the variational Gaussian mixture models with a Dirichlet process prior can take a conservative stand, and fit only one component.

 

Dirichlet distribution 具有自动的特征选取的作用,找出起主要作用的components。

5 for GMM
[ 0.1258077   0.23638361  0.23330578  0.26361639  0.14088652]
5 for Bayesian GMM
[ 0.001019    0.00101796  0.49948856  0.47955123  0.01892325]

 

问题来了:

为什么dirichlet会让三个的权重偏小,而GMM却没有,难道是收敛速度不同?

应该跟速度没有关系。加了先验后,后验变为了dirichlet,那么参数的估计过程中便具备了dirichlet的良好性质。

原始数据

Our data set will be the classic Old Faithful dataset

plt.scatter(data['eruptions'], data['waiting'], alpha=0.5);
plt.xlabel('eruptions');
plt.ylabel('waiting');

 如何拟合?

from sklearn.mixture import BayesianGaussianMixture

mixture_model = BayesianGaussianMixture(
    n_components=10,
    random_state=5,  # control the pseudo-random initialization
    weight_concentration_prior_type='dirichlet_distribution',
    weight_concentration_prior=1.0,  # parameter of the Dirichlet component prior
    max_iter=200,  # choose this to be big in case it takes a long time to fit
)
mixture_model.fit(data);

 

Ref: http://scikit-learn.org/stable/auto_examples/mixture/plot_concentration_prior.html

可直接调用该程式:

 

plot_ellipses(ax1, model.weights_, model.means_, model.covariances_)

def plot_ellipses(ax, weights, means, covars):
    """
    Given a list of mixture component weights, means, and covariances,
    plot ellipses to show the orientation and scale of the Gaussian mixture dispersal.
    """
    for n in range(means.shape[0]):
        eig_vals, eig_vecs = (covars[n])
unit_eig_vec
= eig_vecs[0] / np.linalg.norm(eig_vecs[0]) angle = np.arctan2(unit_eig_vec[1], unit_eig_vec[0]) # Ellipse needs degrees angle = 180 * angle / np.pi # eigenvector normalization eig_vals = 2 * np.sqrt(2) * np.sqrt(eig_vals) ell = mpl.patches.Ellipse( means[n], eig_vals[0], eig_vals[1], 180 + angle, edgecolor=None,) ell2 = mpl.patches.Ellipse( means[n], eig_vals[0], eig_vals[1], 180 + angle, edgecolor='black', fill=False, linewidth=1,) ell.set_clip_box(ax.bbox) ell2.set_clip_box(ax.bbox) ell.set_alpha(weights[n]) ell.set_facecolor('#56B4E9') ax.add_artist(ell) ax.add_artist(ell2)

 

plot_results(
    mixture_model,
    data['eruptions'], data['waiting'],
    'weight_concentration_prior={}'.format(1.0))


def plot_results(model, x, y, title, plot_title=False):
    
    fig, ax = plt.subplots(3, 1, sharex=False)
# 上面是ax没用,以下重新定义了ax1 ax2 gs
= gridspec.GridSpec(3, 1)  # 自定义子图位置 ax1 = plt.subplot(gs[0:2, 0])
# 以下四行是固定套路 ax1.set_title(title) ax1.scatter(x, y, s
=5, marker='o', alpha=0.8) ax1.set_xticks(()) ax1.set_yticks(())
n_components
= model.get_params()['n_components'] plot_ellipses(ax1, model.weights_, model.means_, model.covariances_)
# ax1:画椭圆
# ax2:画权重 ax2
= plt.subplot(gs[2, 0]) ax2.get_xaxis().set_tick_params(direction='out') ax2.yaxis.grid(True, alpha=0.7) for k, w in enumerate(model.weights_): ax2.bar(k, w, width=0.9, color='#56B4E9', zorder=3, align='center', edgecolor='black') ax2.text(k, w + 0.007, "%.1f%%" % (w * 100.), horizontalalignment='center') ax2.set_xlim(-.6, n_components - .4) ax2.set_ylim(0., 1.1) ax2.tick_params(axis='y', which='both', left='off', right='off', labelleft='off') ax2.tick_params(axis='x', which='both', top='off') if plot_title: ax1.set_ylabel('Estimated Mixtures') ax2.set_ylabel('Weight of each component')

 

查看拟合过程:

lower_bounds = []
mixture_model = BayesianGaussianMixture(
    n_components    =10,
    covariance_type ='full',
    max_iter        =1,
    random_state    =2,
    weight_concentration_prior_type ='dirichlet_distribution',
    warm_start      =True,
)
# 设置model.fit为只递归一次
for i in range(200): mixture_model.fit(data) if mixture_model.converged_: break lower_bounds.append(mixture_model.lower_bound_) if i%5==0 and i<60: plt.figure(); plot_results( mixture_model, data['eruptions'], data['waiting'], 'EM step={}, lower_bound={}'.format( i, mixture_model.lower_bound_) ); plt.figure(); plt.plot(lower_bounds); plt.gca().set_xlabel('step') plt.gca().set_ylabel('lower bound')

 

 

 

 

Lower bound 逐渐增加。 

 

不同初始,效果对比:

for seed in range(6,11):
    lower_bounds = []
    mixture_model = BayesianGaussianMixture(
        n_components=10,
        covariance_type='full',
        max_iter=1,
        random_state=seed,
        weight_concentration_prior_type='dirichlet_distribution',
        warm_start=True,
    )
    for i in range(200):
        mixture_model.fit(data)
        if mixture_model.converged_: break
        lower_bounds.append(mixture_model.lower_bound_)
        plt.plot(lower_bounds);
plt.gca().set_xlabel('step')
plt.gca().set_ylabel('lower bound');

Result: 

 

posted @ 2017-09-13 22:27  郝壹贰叁  阅读(2409)  评论(0编辑  收藏  举报