Pandas选择与索引
Series和DataFrame对象与Numpy数组和标准Python字典的数据索引和选择模式一样。
字典形式选择和索引
Series
In [1]: import pandas as pd
In [2]: data = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])
In [3]: data
Out[3]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
#用判断字典键的方法,判断某字段是否包含在Series的Index对象内
In [4]: 'a' in data
Out[4]: True
#用获取字典键的方法,获取Index对象
In [6]: data.keys()
Out[6]: Index(['a', 'b', 'c', 'd'], dtype='object')
#用获取字典键值对象的方法,获取Index/value映射对象
In [9]: list(data.items())
Out[9]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
#用添加字典键值对象的方法,添加Series数据
In [10]: data['e'] = 2
In [11]: data
Out[11]:
a 0.25
b 0.50
c 0.75
d 1.00
e 2.00
dtype: float64
DataFrame:相当于若干个Series对象构成的字典
In [13]: area = pd.Series({'California':43231,'Texas':53432,'New York':67632,
'Florida':89734,'Illinois':63321})
In [14]: pop = pd.Series({'California':453231,'Texas':533432,'New York':676732,
'Florida':897534,'Illinois':633241})
#以Series对象为字典对象进行创建DataFrame对象
In [15]: data = pd.DataFrame({'area':area,'pop':pop})
In [16]: data
Out[16]:
area pop
California 43231 453231
Texas 53432 533432
New York 67632 676732
Florida 89734 897534
Illinois 63321 633241
#获取某列(相当于获取字典某个键值)
In [17]: data['area']
Out[17]:
California 43231
Texas 53432
New York 67632
Florida 89734
Illinois 63321
Name: area, dtype: int64
#以属性形式获取某列
In [18]: data.area
Out[18]:
California 43231
Texas 53432
New York 67632
Florida 89734
Illinois 63321
Name: area, dtype: int64
#以字典形式赋值一个新列
In [19]: data['density'] = data['pop']/data['area']
In [20]: data
Out[20]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
数组形式选择和索引
Series相当于Numpy的一维数组
In [48]: data
Out[48]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
#以显式索引进行切片,包含最后一个索引对应的值
In [49]: data['a':'c']
Out[49]:
a 0.25
b 0.50
c 0.75
dtype: float64
#以隐式索引进行切片,不包含最后一个索引对应的值
In [50]: data[0:2]
Out[50]:
a 0.25
b 0.50
dtype: float64
#利用布尔索引进行取值
In [51]: data[(data > 0.3) & (data < 0.8)]
Out[51]:
b 0.50
c 0.75
dtype: float64
#利用花哨索引进行取值
In [52]: data[['a','d']]
Out[52]:
a 0.25
d 1.00
dtype: float64
DataFrame相当于Numpy的二维数组
In [22]: data
Out[22]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
#获取DataFrame的数组值,返回的是Numpy二维数组
In [23]: data.values
Out[23]:
array([[4.32310000e+04, 4.53231000e+05, 1.04839351e+01],
[5.34320000e+04, 5.33432000e+05, 9.98338075e+00],
[6.76320000e+04, 6.76732000e+05, 1.00060918e+01],
[8.97340000e+04, 8.97534000e+05, 1.00021619e+01],
[6.33210000e+04, 6.33241000e+05, 1.00004896e+01]])
#对DataFrame的数组进行装置,相当于Numpy二维数组的转置
In [24]: data.T
Out[24]:
California Texas New York Florida Illinois
area 43231.000000 53432.000000 67632.000000 89734.000000 63321.00000
pop 453231.000000 533432.000000 676732.000000 897534.000000 633241.00000
density 10.483935 9.983381 10.006092 10.002162 10.00049
#获得某一行的数组(方法同Numpy数组)
In [25]: data.values[0]
Out[25]: array([4.32310000e+04, 4.53231000e+05, 1.04839351e+01])
#布尔索引
In [27]: data[data > 10]
Out[27]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 NaN
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [28]: data[data['density']>10]
Out[28]:
area pop density
California 43231 453231 10.483935
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
索引选择器(iloc 和 loc)
为了避免在Index和Columns为整数时,造成切片和取值时的混乱,Panda提供了一些索引器(indexer)属性来进行取值。
iloc:隐私索引(顺序整数)
#Series
In [53]: data
Out[53]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In [54]: data.iloc[0:2]
Out[54]:
a 0.25
b 0.50
dtype: float64
In [55]: data.iloc[2]
Out[55]: 0.75
#DataFrame
In [22]: data
Out[22]:
area pop density
California 43231 453231 10.483935
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [31]: data.iloc[:3,:2]
Out[31]:
area pop
California 43231 453231
Texas 53432 533432
New York 67632 676732
In [32]: data.iloc[0,2] = 100
In [33]: data
Out[33]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
In [34]: data.iloc[3,2]
Out[34]: 10.002161945305012
In [35]: data.iloc[:3,2]
Out[35]:
California 100.000000
Texas 9.983381
New York 10.006092
Name: density, dtype: float64
In [36]: data.iloc[3,:2]
Out[36]:
area 89734.0
pop 897534.0
Name: Florida, dtype: float64
#切出的是DataFrame结构
In [37]: data.iloc[2:3,:2]
Out[37]:
area pop
New York 67632 676732
loc:显示索引(自定义)
#Series
In [53]: data
Out[53]:
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
In [56]: data.loc['b']
Out[56]: 0.5
In [57]: data.loc['a':'b']
Out[57]:
a 0.25
b 0.50
dtype: float64
In [58]: data.loc[['a','c']]
Out[58]:
a 0.25
c 0.75
dtype: float64
#DataFrame
In [38]: data
Out[38]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
New York 67632 676732 10.006092
Florida 89734 897534 10.002162
Illinois 63321 633241 10.000490
#显性索引切片时与整数切片不同,最后一个切片位置也要包含在内
In [39]: data['California':'Texas']
Out[39]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
In [40]: data.loc['California':'Texas']
Out[40]:
area pop density
California 43231 453231 100.000000
Texas 53432 533432 9.983381
In [41]: data.loc[data['density']>10,['area','density']]
Out[41]:
area density
California 43231 100.000000
New York 67632 10.006092
Florida 89734 10.002162
Illinois 63321 10.000490
In [44]: data.loc[:,['pop']]
Out[44]:
pop
California 453231
Texas 533432
New York 676732
Florida 897534
Illinois 633241
多级索引应用(Hierarchical Indexing)
为解决高维数组表达的需求而引入多级索引(Hierarchical indexing)。Pandas的MultiIndex类型提供了丰富的操作手法。
多级索引创建
#直接创建,index写成多级模式,MultiIndex在后台自动创建
In [3]: data = pd.DataFrame(np.random.rand(4,2),
index = [['a','a','b','b'],[1,2,1,2]],
columns=['data1','data2'])
In [4]: data
Out[4]:
data1 data2
a 1 0.154166 0.773928
2 0.859236 0.777436
b 1 0.468589 0.010060
2 0.849230 0.929585
#使用元组字典进行创建,自动把元组元素分割转换成MultiIndex
In [5]: data = {('California',2000):3334323,('California',2010):33444332,
('Texas',2000):5555999,('Texas',2010):7778839}
In [6]: pd.Series(data)
Out[6]:
California 2000 3334323
2010 33444332
Texas 2000 5555999
2010 7778839
dtype: int64
#使用列表进行创建
In [59]: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
Out[59]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],)
#使用元组进行创建
In [60]: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
Out[60]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
#使用笛卡尔积进行创建
In [62]: pd.MultiIndex.from_product([['a','b'],[1,2]])
Out[62]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
#直接利用levels和labels(codes)进行创建
In [63]: pd.MultiIndex(levels=[['a','b'],[1,2]],labels=[[0,0,1,1],[0,1,0,1]])
D:\Anaconda\install\Scripts\ipython:1: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
Out[63]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
In [64]: pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
Out[64]:
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
多级索引取值与切片
Series
In [1]: import pandas as pd
In [2]: pop_index = pd.MultiIndex.from_product([['California','Texas','New York'],
['2000','2010']])
In [3]: data = [23343,33453,78893,34569,56493,89392]
In [4]: pop = pd.Series(data, index=pop_index, name='population')
In [5]: pop
Out[5]:
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#添加索引名称
In [6]: pop.index.names = ['state','year']
In [7]: pop
Out[7]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [23]: pop
Out[23]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#利用显性索引进行局部切片时,需要MultiIndex按顺序排列的(利用sort_index函数实现)
In [25]: pop
Out[25]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [26]: pop = pop.sort_index()
In [27]: pop
Out[27]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
Name: population, dtype: int64
#显性索引局部切片
In [28]: pop.loc['California':'New York']
Out[28]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Name: population, dtype: int64
In [29]: pop.loc['California':'New York','2000']
Out[29]:
state year
California 2000 23343
New York 2000 56493
Name: population, dtype: int64
#显性+花哨索引取值
In [30]: pop.loc[['California','Texas']]
Out[30]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
Name: population, dtype: int64
#隐性索引局部切片
In [32]: pop.iloc[:2]
Out[32]:
state year
California 2000 23343
2010 33453
Name: population, dtype: int64
#布尔索引取值
In [33]: pop[pop > 30000]
Out[33]:
state year
California 2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
Name: population, dtype: int64
DataFrame
In [1]: import pandas as pd
In [4]: import numpy as np
#行标签
In [2]: index = pd.MultiIndex.from_product([[2019,2020],[1,2,3,4]],
names=['year','quarter'])
#列标签
In [3]: columns = pd.MultiIndex.from_product([['Bill','Andy','Coco'],
['Temp','HR']],names=['name','type'])
#获取模拟数据
In [12]: data = np.round(np.random.randn(8,6),1)
In [15]: data[:,::2] *10
Out[15]:
array([[ 8., -5., -10.],
[ 4., 22., 4.],
[ 14., -19., -18.],
[ 8., -1., -9.],
[ 2., -5., 4.],
[ 17., 7., 4.],
[ 0., 3., 2.],
[ 3., 11., -11.]])
In [17]: data += 37
In [18]: data
Out[18]:
array([[45. , 35.4, 32. , 38.1, 27. , 37.7],
[41. , 36.9, 59. , 35.4, 41. , 37.1],
[51. , 37.1, 18. , 37.5, 19. , 36.7],
[45. , 37.6, 36. , 37.6, 28. , 37.5],
[39. , 35.9, 32. , 37.2, 41. , 38. ],
[54. , 36.8, 44. , 38.7, 41. , 36.9],
[37. , 37. , 40. , 35.1, 39. , 37.9],
[40. , 37.5, 48. , 37.5, 26. , 35.7]])
#创建带行列标签的DataFrame
In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)
In [20]: health_data
Out[20]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 45.0 35.4 32.0 38.1 27.0 37.7
2 41.0 36.9 59.0 35.4 41.0 37.1
3 51.0 37.1 18.0 37.5 19.0 36.7
4 45.0 37.6 36.0 37.6 28.0 37.5
2020 1 39.0 35.9 32.0 37.2 41.0 38.0
2 54.0 36.8 44.0 38.7 41.0 36.9
3 37.0 37.0 40.0 35.1 39.0 37.9
4 40.0 37.5 48.0 37.5 26.0 35.7
#直接根据标签名称索引获取值
In [21]: health_data['Bill']
Out[21]:
type Temp HR
year quarter
2019 1 45.0 35.4
2 41.0 36.9
3 51.0 37.1
4 45.0 37.6
2020 1 39.0 35.9
2 54.0 36.8
3 37.0 37.0
4 40.0 37.5
In [22]: health_data['Bill', 'HR']
Out[22]:
year quarter
2019 1 35.4
2 36.9
3 37.1
4 37.6
2020 1 35.9
2 36.8
3 37.0
4 37.5
Name: (Bill, HR), dtype: float64
#隐性索引获取值
In [24]: health_data.iloc[:2,:2]
Out[24]:
name Bill
type Temp HR
year quarter
2019 1 45.0 35.4
2 41.0 36.9
#显性索引获取值
In [23]: health_data.loc[:,('Bill','HR')]
Out[23]:
year quarter
2019 1 35.4
2 36.9
3 37.1
4 37.6
2020 1 35.9
2 36.8
3 37.0
4 37.5
Name: (Bill, HR), dtype: float64
#使用IndexSlice对象对显性索引进行分割
In [25]: idx = pd.IndexSlice
In [28]: health_data.loc[idx[:,1],idx[:,'HR']]
Out[28]:
name Bill Andy Coco
type HR HR HR
year quarter
2019 1 35.4 38.1 37.7
2020 1 35.9 37.2 38.0
In [29]: health_data.loc[idx[:,2],idx[:,'HR']]
Out[29]:
name Bill Andy Coco
type HR HR HR
year quarter
2019 2 36.9 35.4 37.1
2020 2 36.8 38.7 36.9
多级索引排序和转换
In [18]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
...: index=[list('aabb'),list('1212')],
...: columns = [['Ohio','Ohio','Colorado'],
...: ['Green','Red','Green']])
In [19]: frame
Out[19]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [20]: frame.index.names = ['key1','key2']
In [21]: frame.columns.names = ['state','color']
In [22]: frame
Out[22]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
#交换Index
In [26]: frame.swaplevel('key1','key2')
Out[26]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
In [27]: frame
Out[27]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
In [28]: frame.sort_index(level=1)
Out[28]:
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
In [29]: frame.swaplevel(0,1).sort_index(level=0)
Out[29]:
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11
多级索引行列转换和重置(stack、unstack、set_index、reset_index)
Series
In [6]: pop
Out[6]:
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
Name: population, dtype: int64
#通过level参数设置转换索引的层级
In [7]: pop.unstack(level=0)
Out[7]:
state California New York Texas
year
2000 23343 56493 78893
2010 33453 89392 34569
In [8]: pop.unstack(level=1)
Out[8]:
year 2000 2010
state
California 23343 33453
New York 56493 89392
Texas 78893 34569
In [9]: pop.unstack(level='year')
Out[9]:
year 2000 2010
state
California 23343 33453
New York 56493 89392
Texas 78893 34569
In [10]: pop.unstack(level='year').stack(level='year')
Out[10]:
state year
California 2000 23343
2010 33453
New York 2000 56493
2010 89392
Texas 2000 78893
2010 34569
dtype: int64
#reset_index()把层级索引变成列标签
In [11]: pop_flat = pop.reset_index(name='population')
In [12]: pop_flat
Out[12]:
state year population
0 California 2000 23343
1 California 2010 33453
2 Texas 2000 78893
3 Texas 2010 34569
4 New York 2000 56493
5 New York 2010 89392
#set_index()把列标签设置成层级索引
In [13]: pop_flat.set_index(['state','year'])
Out[13]:
population
state year
California 2000 23343
2010 33453
Texas 2000 78893
2010 34569
New York 2000 56493
2010 89392
DataFrame
In [30]: health_data
Out[30]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 45.0 35.4 32.0 38.1 27.0 37.7
2 41.0 36.9 59.0 35.4 41.0 37.1
3 51.0 37.1 18.0 37.5 19.0 36.7
4 45.0 37.6 36.0 37.6 28.0 37.5
2020 1 39.0 35.9 32.0 37.2 41.0 38.0
2 54.0 36.8 44.0 38.7 41.0 36.9
3 37.0 37.0 40.0 35.1 39.0 37.9
4 40.0 37.5 48.0 37.5 26.0 35.7
#从列堆积到行:stack(level=*),*表示索引的序号,用隐性索引则自上而下从0开始索引,用显性索引可直接赋值索引名称
#默认level=1
In [31]: health_data.stack()
Out[31]:
name Andy Bill Coco
year quarter type
2019 1 HR 38.1 35.4 37.7
Temp 32.0 45.0 27.0
2 HR 35.4 36.9 37.1
Temp 59.0 41.0 41.0
3 HR 37.5 37.1 36.7
Temp 18.0 51.0 19.0
4 HR 37.6 37.6 37.5
Temp 36.0 45.0 28.0
2020 1 HR 37.2 35.9 38.0
Temp 32.0 39.0 41.0
2 HR 38.7 36.8 36.9
Temp 44.0 54.0 41.0
3 HR 35.1 37.0 37.9
Temp 40.0 37.0 39.0
4 HR 37.5 37.5 35.7
Temp 48.0 40.0 26.0
In [33]: health_data.stack(level=1)
Out[33]:
name Andy Bill Coco
year quarter type
2019 1 HR 38.1 35.4 37.7
Temp 32.0 45.0 27.0
2 HR 35.4 36.9 37.1
Temp 59.0 41.0 41.0
3 HR 37.5 37.1 36.7
Temp 18.0 51.0 19.0
4 HR 37.6 37.6 37.5
Temp 36.0 45.0 28.0
2020 1 HR 37.2 35.9 38.0
Temp 32.0 39.0 41.0
2 HR 38.7 36.8 36.9
Temp 44.0 54.0 41.0
3 HR 35.1 37.0 37.9
Temp 40.0 37.0 39.0
4 HR 37.5 37.5 35.7
Temp 48.0 40.0 26.0
In [32]: health_data.stack(level=0)
Out[32]:
type HR Temp
year quarter name
2019 1 Andy 38.1 32.0
Bill 35.4 45.0
Coco 37.7 27.0
2 Andy 35.4 59.0
Bill 36.9 41.0
Coco 37.1 41.0
3 Andy 37.5 18.0
Bill 37.1 51.0
Coco 36.7 19.0
4 Andy 37.6 36.0
Bill 37.6 45.0
Coco 37.5 28.0
2020 1 Andy 37.2 32.0
Bill 35.9 39.0
Coco 38.0 41.0
2 Andy 38.7 44.0
Bill 36.8 54.0
Coco 36.9 41.0
3 Andy 35.1 40.0
Bill 37.0 37.0
Coco 37.9 39.0
4 Andy 37.5 48.0
Bill 37.5 40.0
Coco 35.7 26.0
In [38]: health_data.stack(level='name')
Out[38]:
type HR Temp
year quarter name
2019 1 Andy 38.1 32.0
Bill 35.4 45.0
Coco 37.7 27.0
2 Andy 35.4 59.0
Bill 36.9 41.0
Coco 37.1 41.0
3 Andy 37.5 18.0
Bill 37.1 51.0
Coco 36.7 19.0
4 Andy 37.6 36.0
Bill 37.6 45.0
Coco 37.5 28.0
2020 1 Andy 37.2 32.0
Bill 35.9 39.0
Coco 38.0 41.0
2 Andy 38.7 44.0
Bill 36.8 54.0
Coco 36.9 41.0
3 Andy 35.1 40.0
Bill 37.0 37.0
Coco 37.9 39.0
4 Andy 37.5 48.0
Bill 37.5 40.0
Coco 35.7 26.0
#从行散开到列:unstack(level=*),*表示索引的序号,用隐性索引则自左而右从0开始索引,用显性索引可直接赋值索引名称
#默认level=1
In [35]: health_data.unstack(level=0)
Out[35]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
quarter
1 45.0 39.0 35.4 35.9 32.0 32.0 38.1 37.2 27.0 41.0 37.7 38.0
2 41.0 54.0 36.9 36.8 59.0 44.0 35.4 38.7 41.0 41.0 37.1 36.9
3 51.0 37.0 37.1 37.0 18.0 40.0 37.5 35.1 19.0 39.0 36.7 37.9
4 45.0 40.0 37.6 37.5 36.0 48.0 37.6 37.5 28.0 26.0 37.5 35.7
In [36]: health_data.unstack(level=1)
Out[36]:
name Bill Andy ... Coco
type Temp HR Temp ... HR Temp HR
quarter 1 2 3 4 1 2 3 4 1 ... 4 1 2 3 4 1 2 3 4
year ...
2019 45.0 41.0 51.0 45.0 35.4 36.9 37.1 37.6 32.0 ... 37.6 27.0 41.0 19.0 28.0 37.7 37.1 36.7 37.5
2020 39.0 54.0 37.0 40.0 35.9 36.8 37.0 37.5 32.0 ... 37.5 41.0 41.0 39.0 26.0 38.0 36.9 37.9 35.7
[2 rows x 24 columns]
In [39]: health_data.unstack(level='year')
Out[39]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020 2019 2020
quarter
1 45.0 39.0 35.4 35.9 32.0 32.0 38.1 37.2 27.0 41.0 37.7 38.0
2 41.0 54.0 36.9 36.8 59.0 44.0 35.4 38.7 41.0 41.0 37.1 36.9
3 51.0 37.0 37.1 37.0 18.0 40.0 37.5 35.1 19.0 39.0 36.7 37.9
4 45.0 40.0 37.6 37.5 36.0 48.0 37.6 37.5 28.0 26.0 37.5 35.7
多级索引数据统计方法
In [23]: health_data
Out[23]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year quarter
2019 1 37.1 35.3 37.2 37.5 36.2 37.6
2 36.9 37.0 37.8 35.9 37.5 36.6
3 36.8 34.7 35.4 37.5 37.0 38.2
4 36.8 37.6 36.2 37.4 36.4 36.7
2020 1 37.5 34.8 36.8 36.5 36.3 37.4
2 36.2 36.7 37.3 37.2 36.3 36.4
3 37.7 38.1 36.9 37.7 36.6 36.1
4 36.5 36.3 37.6 36.3 35.5 37.3
#level参数设置累计的索引层级
In [24]: data_mean = health_data.mean(level='year')
In [25]: data_mean
Out[25]:
name Bill Andy Coco
type Temp HR Temp HR Temp HR
year
2019 36.900 36.150 36.65 37.075 36.775 37.275
2020 36.975 36.475 37.15 36.925 36.175 36.800
#axis参数设置累计的轴
In [26]: data_mean.mean(axis=1,level='type')
Out[26]:
type Temp HR
year
2019 36.775000 36.833333
2020 36.766667 36.733333