绘制直方图与求解核密度估计的Python方法
本文用到的库如下:
1 from sklearn.neighbors import KernelDensity 2 from scipy.stats import gaussian_kde 3 from statsmodels.nonparametric.kde import KDEUnivariate 4 from statsmodels.nonparametric.kernel_density import KDEMultivariate 5 import numpy as np 6 import matplotlib.pyplot as plt 7 from scipy.stats.distributions import norm 8 import seaborn as sns
绘制直方图
Python中数据可视化库matplotlib和seaborn都可以用于绘制直方图。
1 def PLotHist(): 2 """plot hist in matplotlib and seaborn""" 3 N = 1000 4 X = np.concatenate((np.random.normal(0,1,int(0.3 * N)), 5 np.random.normal(5,1,int(0.7 * N))))[:,np.newaxis] 6 X_plot = np.linspace(-5,10,1000)[:,np.newaxis] 7 fig = plt.figure() 8 ax1 = fig.add_subplot(231) 9 ax1.hist(X[:,0],bins=10,label='plt.hist') 10 ax1.legend(loc='best') 11 ax2 = fig.add_subplot(232) 12 sns.distplot(X[:,0], bins=10,kde = False,label='sns.histplot',norm_hist=False) 13 ax2.legend(loc='best') 14 ax3 = fig.add_subplot(233) 15 sns.distplot(X[:,0], bins=10,kde=True,label='sns.histplot') 16 plt.show()
运行结果如下:
直方图绘制需要注意:
1,bins 分组数量;
2,kde 是否进行核密度估计,False不绘制核密度估计曲线,只能是高斯核函数,如上图2和图3;
3,norm_hist 是否归一化,如上图2和图3。
参考:https://blog.csdn.net/jinruoyanxu/article/details/53390943
核密度估计
核密度估计Kernel Density Estimation(KDE)是在概率论中用来估计未知的密度函数,属于非参数检验方法之一。关于KDE的讲解网上有很多,下面重点讲下KDE在Python中的实现。
Python中计算核密度估计的库有:Scipy,Statsmodels,Scikit-Learn
函数参数如下:
1,class scipy.stats.
gaussian_kde
(dataset, bw_method=None, weights=None)
2,class sklearn.neighbors.
KernelDensity
(bandwidth=1.0, algorithm='auto', kernel='gaussian', metric='euclidean', atol=0, rtol=0, breadth_first=True, leaf_size=40, metric_params=None)
3,statsmodels.nonparametric.kde.
KDEUnivariate
(endog).fit
([kernel, bw, fft, weights, gridsize, …])
参数kernel核函数
可使用的核函数:
1,scipy.stats.gaussian_kde ,只有一个高斯核函数
2,Statsmodels
1 from statsmodels.nonparametric.kde import kernel_switch 2 list(kernel_switch.keys()) 3 4 ['gau', 'epa', 'uni', 'tri', 'biw', 'triw', 'cos', 'cos2']
3,Scikit-Learn,核函数有:'gaussian', 'tophat', 'epanechnikov','exponential', 'linear', 'cosine'
bandwidth选择
bw的选择有两种方法,一是在统计学中基于预先假设的推理规则,如Silverman's rule,另一种是机器学习领域的交叉验证。
在scikit-learn中选择不同bw参数比较:
1 def PlotBandWidths(): 2 '''plot different bandWidth of scikit-learn ''' 3 N = 100 4 np.random.seed(1) 5 X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)), 6 np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis] 7 8 X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis] 9 10 true_dens = (0.3 * norm(0, 1).pdf(X_plot[:, 0]) 11 + 0.7 * norm(5, 1).pdf(X_plot[:, 0])) 12 13 fig, ax = plt.subplots() 14 ax.fill(X_plot[:, 0], true_dens, fc='black', alpha=0.2, 15 label='input distribution') 16 colors = ['navy', 'cornflowerblue', 'darkorange'] 17 bws = [0.1, 0.5, 1] 18 lw = 2 19 20 for color, bw in zip(colors, bws): 21 kde = KernelDensity(kernel='gaussian', bandwidth=bw).fit(X) 22 log_dens = kde.score_samples(X_plot) 23 ax.plot(X_plot[:, 0], np.exp(log_dens), color=color, lw=lw, 24 linestyle='-', label="bandwidth = '{0}'".format(bw)) 25 26 ax.text(6, 0.38, "N={0} points".format(N)) 27 28 ax.legend(loc='upper left') 29 ax.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), '+k') 30 31 ax.set_xlim(-4, 9) 32 ax.set_ylim(-0.02, 0.4)
运行结果:
关于三种函数的比较可以参考:http://www.pythontip.com/blog/post/8988/
函数的使用参考以下官网:
1,Scikit-learn http://lijiancheng0614.github.io/scikit-learn/auto_examples/neighbors/plot_species_kde.html
2,scipy https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html
3,statsmodels https://www.statsmodels.org/stable/examples/notebooks/generated/kernel_density.html?highlight=statsmodels%20nonparametric%20kde
4,seaborn http://seaborn.pydata.org/generated/seaborn.kdeplot.html#seaborn.kdeplot