Python：探究Matplotlib直方图绘制中的参数bins和rwidth

1 情境引入

我们在做机器学习相关项目时，常常会分析数据集的样本分布，而这就需要用到直方图的绘制。

在Python中可以很容易地调用matplotlib.pyplot的hist函数来绘制直方图。不过，该函数参数不少，有几个绘图的小细节也需要注意。

首先，我们假定现在有个联邦学习的项目情景。我们有一个样本个数为15的图片数据集，样本标签有3个，分别为cat, dog, car。这个数据集已经被不均衡地划分到4个客户端(client)上，如像下面表示：

n_clients = 4
classes = ["cat", "dog", "car"]
# 表示每种label的样本都划分到了哪些client上
# 字典value为client ID组成的列表
label_to_cids = {"cat": [0, 0, 2, 3], "dog": [
    0, 1, 1, 2, 3, 3, 3], "car": [0, 2, 2, 3]}

如何我们需要可视化cat类别的样本在客户端的分布情况，我们可以写出如下代码：

plt.figure(figsize=(5, 3))
plt.hist(label_to_cids["cat"], stacked=False, bins=n_clients, label="cat")

plt.xticks(np.arange(n_clients), [
           "Client {}".format(i) for i in range(n_clients)])
plt.legend()
plt.show()

此时的可视化结果如下：

在这个示例中，我们所调用的hist函数以cat类别的样本所属的Client ID为输入值，并将这个由样本的client ID所组成的列表按照[0, 1, 2, 3]这个等距分段划分到了4个bins（箱子^[1]）里（分别对应4个client ID），并统计每个bins中的样本数量。

如果需要可视化所有类别的样本在客户端的分布情况，我们可以将代码修改为：

plt.figure(figsize=(5, 3))
plt.hist(label_to_cids.values(), stacked=False, bins=n_clients, label=classes)

plt.xticks(np.arange(n_clients), [
           "Client {}".format(i) for i in range(n_clients)])
plt.legend()
plt.show()

此时的可视化结果如下：

这个示例和我们上一个示例稍有不同，此时的输入值不再是单个列表，而是一组不定长列表，其中每个列表对应一种类别的样本分布情况。但原理和示例一相同，我们依次将每个类别的client ID列表划分到4个bins中（图中不同颜色的bar（条）即对应不同的类别）。

这时我们会发现，我们x轴上的标签和上方的bar并没有对齐（在示例一中，1个bin包含1种类型的bar；在示例二中，1个bin包含3种类型的bar），而这时需要我们调整bins这个参数。

2 bins 参数

在讲述bins参数之前我们先来熟悉一下hist绘图中bin和bar的含义。下面是它们的诠释图：

这里\(x_1\)、\(x_2\)是x轴对象，在hist中，默认x轴第一个对象对应刻度为0，第2个对象刻度为1，依次类图。在这个诠释图上，bin（原意为箱子）就是指每个x轴对象所占优的矩形绘图区域，bar(原意为条)就是指每个矩形绘图区域中的条形。如上图所示，x轴第一个对象对应的bin区间为[-0.5, 0.5)，第2个对象对应的bin区域为[0.5, 1)(注意，hist规定一定是左闭又开)。每个对象的bin区域内都有3个bar。

如果读者学过《随机算法》^[2]课程，应该就会反应过来，这其实就是球与箱子模型。我们将输入数组视为球，将bin视为箱子，那么我们这里就是要可视化球在箱子里的分布情况，也即可视化每个箱子里球的个数。

通过查阅matplotlib文档^[3]，我们知道了bins参数的解释如下：

bins: int or sequence or str, default: rcParams["hist.bins"] (default: 10)
If bins is an integer, it defines the number of equal-width bins in the range.
If bins is a sequence, it defines the bin edges, including the left edge of the first bin and the right edge of the last bin; in this case, bins may be unequally spaced. All but the last (righthand-most) bin is half-open. In other words, if bins is:
[1, 2, 3, 4]
then the first bin is [1, 2) (including 1, but excluding 2) and the second [2, 3). The last bin, however, is [3, 4], which includes 4.
If bins is a string, it is one of the binning strategies supported by numpy.histogram_bin_edges: 'auto', 'fd', 'doane', 'scott', 'stone', 'rice', 'sturges', or 'sqrt'.

我来概括一下，也就是说如果bins是个数字，那么它设置的是bin的个数，也就是沿着x轴划分多少个独立的绘图区域。我们这里有四个client，故需要设置4个绘图区域，每个区域相对于x轴刻度的偏移采取默认设置。

不过，如果我们要设置每个区域的位置偏移，我们就需要将bins设置为一个序列。

bins序列的刻度要参照hist函数中的x坐标刻度来设置，本任务中4个分类类别对应的x轴刻度分别为[0, 1, 2, 3] 。如果我们将序列设置为[0, 1, 2, 3, 4]就表示第一个绘图区域对应的区间是[1, 2)，第2个绘图区域对应的位置是[1, 2),第三个绘图区域对应的位置是[2, 3)，依次类推。

就大众审美而言，我们想让每个区域的中心和对应x轴刻度对齐，这第一个区域的区间为[-0.5, 0.5)，第二个区域的区间为[0.5, 1.5)，依次类推。则最终的bins序列为[-0.5, 0.5, 1.5, 2.5, 3.5]。于是，我们将hist函数的bins参数修改为np.arange(-0.5, 4, 1)：

plt.hist(label_to_cids.values(), stacked=False,
         bins=np.arange(-0.5, 4, 1), label=classes)

这样，每个划分区域和对应x轴的刻度就对齐了：

3 stacked参数

有时x轴的项目多了，每个x轴的对象都要设置3个bar对绘图空间无疑是一个巨大的占用。在这个情况下我们如何压缩空间的使用呢？这个时候参数stacked就派上了用场，我们将参数stacked设置为True:

plt.hist(label_to_cids.values(), stacked=True,
         bins=np.arange(-0.5, 4, 1), label=classes)

可以看到每个x轴对象的bar都“叠加”起来了：

不过，新的问题又出来了，这样每x轴对象的bar之间完全没有距离了，显得十分“拥挤”（事实上，我们在本文第1部分对cat类别的样本在客户端的分布情况进行可视化时，就已经遇到了这种情况）。我们可否修改bins参数以设置区域bin之间的间距呢？答案是不行，因为我们前面提到过，bins参数中只能将区域设置为连续排布的。

换一个思路，我们设置每个bin内的bar和bin边界之间的间距。此时，我们需要修改r_width参数。

4 rwidth 参数

我们看文档中对rwidth参数的解释：

rwidth: float or None, default: None
The relative width of the bars as a fraction of the bin width. If None, automatically compute the width.
Ignored if histtype is 'step' or 'stepfilled'.

翻译一下，rwidth用于设置每个bin中的bar相对bin的大小。这里我们不妨修改为0.5：

plt.hist([train_labels[idc]for idc in client_idcs],stacked=True, 
         bins=np.arange(-0.5, 4, 1), rwidth=0.5, 
        label=["Client {}".format(i) for i in range(N_CLIENTS)])

修改之后的图表如下：

可以看到每个x轴元素内的bar正好占对应bin的宽度的二分之一。

参考

[1] McKinney W. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython, 2nd edition[M]. " O'Reilly Media, Inc.", 2018.
[2] Mitzenmacher M, Upfal E. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis, second edition[M]. Cambridge university press, 2017.
[3] Matplotlib: matplotlib.pyplot.hist

posted @ 2022-02-03 19:10 orion-orion 阅读(10559) 评论(0) 收藏举报

刷新页面返回顶部

Orion's Blog

联邦学习、图机器学习、推荐系统

Python：探究Matplotlib直方图绘制中的参数bins和rwidth

1 情境引入

2 bins 参数

3 stacked参数

4 rwidth 参数

参考

公告