4-Pandas数据预处理之离散化、面元划分(等距pd.cut()、等频pd.pcut()))
有时在处理连续型数据时,为了方便分析,需要将其进行离散化或者是拆分成“面元(bin)”,即将数据放置于一个小区间中。
在Pandas中,cut()--->数据离散化
qcut()-->面元划分
一、cut():等距离散化,设置的bins的每个区间的间隔相等。
与排序与随机重排中采用同样的例子,即“新冠肺炎”的例子。
此时对累计确诊那一列进行操作,首先查看其最大值和最小值,便于了解将数据划分为多少个组别:在此将数据划分7个组别,如下:
>>> df['total_confirm'].max() 677146 >>> df['total_confirm'].min() 1 >>> bins = [0,10000,20000,30000,40000,50000,60000,70000] >>> pd.cut(df['total_confirm'],bins)[:8] 0 (0.0, 10000.0] 1 (0.0, 10000.0] 2 NaN 3 (10000.0, 20000.0] 4 (0.0, 10000.0] 5 (0.0, 10000.0] 6 (10000.0, 20000.0] 7 (0.0, 10000.0] Name: total_confirm, dtype: category Categories (7, interval[int64]): [(0, 10000] < (10000, 20000] < (20000, 30000] < (30000, 40000] < (40000, 50000] < (50000, 60000] < (60000, 70000]]
通过labels参数可以将这些区间换成其他的字符串
>>> pd.cut(df['total_confirm'],bins=bins,labels=['A','B','C','D','E','F','G'])[:8] 0 A 1 A 2 NaN 3 B 4 A 5 A 6 B 7 A Name: total_confirm, dtype: category Categories (7, object): [A < B < C < D < E < F < G]
二、qcut():等频离散化,每个区间的样本数相同。
#分成8个等频区间 >>> bs = pd.qcut(df['total_confirm'],8)[:5] >>> bs = pd.qcut(df['total_confirm'],8) >>> bs[:5] 0 (380.5, 979.5] 1 (2720.75, 8321.25] 2 (8321.25, 677146.0] 3 (8321.25, 677146.0] 4 (979.5, 2720.75] Name: total_confirm, dtype: category Categories (8, interval[float64]): [(0.999, 12.0] < (12.0, 35.0] < (35.0, 122.375] < (122.375, 380.5] < (380.5, 979.5] < (979.5, 2720.75] < (2720.75, 8321.25] < (8321.25, 677146.0]] #查看每个区间的样本数 >>> bs.value_counts() (0.999, 12.0] 28 (8321.25, 677146.0] 26 (979.5, 2720.75] 26 (2720.75, 8321.25] 25 (380.5, 979.5] 25 (122.375, 380.5] 25 (12.0, 35.0] 25 (35.0, 122.375] 24 Name: total_confirm, dtype: int64
从每个区间的样本数可以发现,每个区间的样本数挺不是完全相等的,所以:此处的等频真正的含义是每个区间的数量并不是理想中的等量,而是大致等量。