绘制直方图与求解核密度估计的Python方法

本文用到的库如下：

1 from sklearn.neighbors import KernelDensity
2 from scipy.stats import gaussian_kde
3 from statsmodels.nonparametric.kde import KDEUnivariate
4 from statsmodels.nonparametric.kernel_density import KDEMultivariate
5 import numpy as np
6 import matplotlib.pyplot as plt
7 from scipy.stats.distributions import norm
8 import seaborn as sns

绘制直方图

Python中数据可视化库matplotlib和seaborn都可以用于绘制直方图。

 1 def PLotHist():
 2     """plot hist in matplotlib and seaborn"""
 3     N = 1000
 4     X = np.concatenate((np.random.normal(0,1,int(0.3 * N)),
 5                         np.random.normal(5,1,int(0.7 * N))))[:,np.newaxis]
 6     X_plot = np.linspace(-5,10,1000)[:,np.newaxis]
 7     fig = plt.figure()
 8     ax1 = fig.add_subplot(231)
 9     ax1.hist(X[:,0],bins=10,label='plt.hist')
10     ax1.legend(loc='best')
11     ax2 = fig.add_subplot(232)
12     sns.distplot(X[:,0], bins=10,kde = False,label='sns.histplot',norm_hist=False)
13     ax2.legend(loc='best')
14     ax3 = fig.add_subplot(233)
15     sns.distplot(X[:,0], bins=10,kde=True,label='sns.histplot')
16     plt.show()

运行结果如下：

直方图绘制需要注意：

1，bins 分组数量；

2，kde 是否进行核密度估计，False不绘制核密度估计曲线，只能是高斯核函数，如上图2和图3；

3，norm_hist 是否归一化，如上图2和图3。

参考：https://blog.csdn.net/jinruoyanxu/article/details/53390943

核密度估计

核密度估计Kernel Density Estimation(KDE)是在概率论中用来估计未知的密度函数，属于非参数检验方法之一。关于KDE的讲解网上有很多，下面重点讲下KDE在Python中的实现。

Python中计算核密度估计的库有：Scipy，Statsmodels，Scikit-Learn

函数参数如下：

1，class scipy.stats.gaussian_kde(dataset, bw_method=None, weights=None)

2，class sklearn.neighbors.KernelDensity(bandwidth=1.0, algorithm='auto', kernel='gaussian', metric='euclidean', atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)

3，statsmodels.nonparametric.kde.KDEUnivariate(endog).fit([kernel, bw, fft, weights, gridsize, …])

参数kernel核函数

可使用的核函数：

1，scipy.stats.gaussian_kde ，只有一个高斯核函数

2，Statsmodels

1 from statsmodels.nonparametric.kde import kernel_switch
2 list(kernel_switch.keys())
3 
4 ['gau', 'epa', 'uni', 'tri', 'biw', 'triw', 'cos', 'cos2']

3，Scikit-Learn，核函数有：'gaussian', 'tophat', 'epanechnikov','exponential', 'linear', 'cosine'

bandwidth选择

bw的选择有两种方法，一是在统计学中基于预先假设的推理规则，如Silverman's rule，另一种是机器学习领域的交叉验证。

在scikit-learn中选择不同bw参数比较：

 1 def PlotBandWidths():
 2     '''plot different bandWidth of scikit-learn '''
 3     N = 100
 4     np.random.seed(1)
 5     X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),
 6                         np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]
 7 
 8     X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
 9 
10     true_dens = (0.3 * norm(0, 1).pdf(X_plot[:, 0])
11                 + 0.7 * norm(5, 1).pdf(X_plot[:, 0]))
12 
13     fig, ax = plt.subplots()
14     ax.fill(X_plot[:, 0], true_dens, fc='black', alpha=0.2,
15             label='input distribution')
16     colors = ['navy', 'cornflowerblue', 'darkorange']
17     bws = [0.1, 0.5, 1]
18     lw = 2
19 
20     for color, bw in zip(colors, bws):
21         kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X)
22         log_dens = kde.score_samples(X_plot)
23         ax.plot(X_plot[:, 0], np.exp(log_dens), color=color, lw=lw,
24                 linestyle='-', label="bandwidth = '{0}'".format(bw))
25 
26     ax.text(6, 0.38, "N={0} points".format(N))
27 
28     ax.legend(loc='upper left')
29     ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), '+k')
30 
31     ax.set_xlim(-4, 9)
32     ax.set_ylim(-0.02, 0.4)