数据聚合与分组运算学习笔记3
4 透视表和交叉表
4.1 透视表
透视表(pivot table): 根据一个或多个键对数据进行聚合,并根据行和列上的分组键将数据分配到各个
矩形区域中。
- DataFrame表有一个pivot_table方法,还有一个顶级的pandas.pivot_table
- pivot_table还有一个分项小计选项,margins
In [32]: tips.pivot_table(index=['sex','smoker'])
Out[32]:
size tip tip_pct total_bill
sex smoker
Female No 2.592593 2.773519 0.156921 18.105185
Yes 2.242424 2.931515 0.182150 17.977879
Male No 2.711340 3.113402 0.160669 19.791237
Yes 2.500000 3.051167 0.152771 22.284500
In [33]: tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker
...: ')
Out[33]:
tip_pct size
smoker No Yes No Yes
sex day
Female Fri 0.165296 0.209129 2.500000 2.000000
Sat 0.147993 0.163817 2.307692 2.200000
Sun 0.165710 0.237075 3.071429 2.500000
Thur 0.155971 0.163073 2.480000 2.428571
Male Fri 0.138005 0.144730 2.000000 2.125000
Sat 0.162132 0.139067 2.656250 2.629630
Sun 0.158291 0.173964 2.883721 2.600000
Thur 0.165706 0.164417 2.500000 2.300000
# 增加分类汇总,margins=True
In [34]: tips.pivot_table(['tip_pct','size'],index=['sex','day'],columns='smoker
...: ',margins=True)
Out[34]:
tip_pct size
smoker No Yes All No Yes All
sex day
Female Fri 0.165296 0.209129 0.199388 2.500000 2.000000 2.111111
Sat 0.147993 0.163817 0.156470 2.307692 2.200000 2.250000
Sun 0.165710 0.237075 0.181569 3.071429 2.500000 2.944444
Thur 0.155971 0.163073 0.157525 2.480000 2.428571 2.468750
Male Fri 0.138005 0.144730 0.143385 2.000000 2.125000 2.100000
Sat 0.162132 0.139067 0.151577 2.656250 2.629630 2.644068
Sun 0.158291 0.173964 0.162344 2.883721 2.600000 2.810345
Thur 0.165706 0.164417 0.165276 2.500000 2.300000 2.433333
All 0.159328 0.163196 0.160803 2.668874 2.408602 2.569672
# 可以使用aggfunc传入其他聚合函数
In [37]: tips.pivot_table(['tip_pct','size'],index=['sex','smoker'],columns='day',aggfunc=len, margins=True)
Out[37]:
tip_pct size
day Fri Sat Sun Thur All Fri Sat Sun Thur All
sex smoker
Female No 2.0 13.0 14.0 25.0 54.0 2.0 13.0 14.0 25.0 54.0
Yes 7.0 15.0 4.0 7.0 33.0 7.0 15.0 4.0 7.0 33.0
Male No 2.0 32.0 43.0 20.0 97.0 2.0 32.0 43.0 20.0 97.0
Yes 8.0 27.0 15.0 10.0 60.0 8.0 27.0 15.0 10.0 60.0
All 19.0 87.0 76.0 62.0 244.0 19.0 87.0 76.0 62.0 244.0
pivot_table参数:
4.2 交叉表
交叉表示一种计算分组频率的特殊透视表,
In [39]: tips[:5]
Out[39]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female No Sun Dinner 2 0.059447
1 10.34 1.66 Male No Sun Dinner 3 0.160542
2 21.01 3.50 Male No Sun Dinner 3 0.166587
3 23.68 3.31 Male No Sun Dinner 2 0.139780
4 24.59 3.61 Female No Sun Dinner 4 0.146808
In [40]:
In [40]: pd.crosstab([tips.time, tips.day],tips.smoker, margins=True)
Out[40]:
smoker No Yes All
time day
Dinner Fri 3 9 12
Sat 45 42 87
Sun 57 19 76
Thur 1 0 1
Lunch Fri 1 6 7
Thur 44 17 61
All 151 93 244