Pandas选择与索引

Series和DataFrame对象与Numpy数组和标准Python字典的数据索引和选择模式一样。

字典形式选择和索引

Series

In [1]: import pandas as pd
In [2]: data = pd.Series([0.25,0.5,0.75,1.0],index=['a','b','c','d'])

In [3]: data
Out[3]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

#用判断字典键的方法,判断某字段是否包含在Series的Index对象内
In [4]: 'a' in data
Out[4]: True

#用获取字典键的方法,获取Index对象
In [6]: data.keys()
Out[6]: Index(['a', 'b', 'c', 'd'], dtype='object')

#用获取字典键值对象的方法,获取Index/value映射对象
In [9]: list(data.items())
Out[9]: [('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

#用添加字典键值对象的方法,添加Series数据
In [10]: data['e'] = 2

In [11]: data
Out[11]:
a    0.25
b    0.50
c    0.75
d    1.00
e    2.00
dtype: float64

DataFrame:相当于若干个Series对象构成的字典

In [13]: area = pd.Series({'California':43231,'Texas':53432,'New York':67632,
                           'Florida':89734,'Illinois':63321})
In [14]: pop = pd.Series({'California':453231,'Texas':533432,'New York':676732,
                          'Florida':897534,'Illinois':633241})

#以Series对象为字典对象进行创建DataFrame对象
In [15]: data = pd.DataFrame({'area':area,'pop':pop})

In [16]: data
Out[16]:
             area     pop
California  43231  453231
Texas       53432  533432
New York    67632  676732
Florida     89734  897534
Illinois    63321  633241

#获取某列(相当于获取字典某个键值)
In [17]: data['area']
Out[17]:
California    43231
Texas         53432
New York      67632
Florida       89734
Illinois      63321
Name: area, dtype: int64

#以属性形式获取某列
In [18]: data.area
Out[18]:
California    43231
Texas         53432
New York      67632
Florida       89734
Illinois      63321
Name: area, dtype: int64

#以字典形式赋值一个新列
In [19]: data['density'] = data['pop']/data['area']

In [20]: data
Out[20]:
             area     pop    density
California  43231  453231  10.483935
Texas       53432  533432   9.983381
New York    67632  676732  10.006092
Florida     89734  897534  10.002162
Illinois    63321  633241  10.000490

数组形式选择和索引

Series相当于Numpy的一维数组

In [48]: data
Out[48]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

#以显式索引进行切片,包含最后一个索引对应的值
In [49]: data['a':'c']
Out[49]:
a    0.25
b    0.50
c    0.75
dtype: float64

#以隐式索引进行切片,不包含最后一个索引对应的值
In [50]: data[0:2]
Out[50]:
a    0.25
b    0.50
dtype: float64

#利用布尔索引进行取值
In [51]: data[(data > 0.3) & (data < 0.8)]
Out[51]:
b    0.50
c    0.75
dtype: float64

#利用花哨索引进行取值
In [52]: data[['a','d']]
Out[52]:
a    0.25
d    1.00
dtype: float64

DataFrame相当于Numpy的二维数组

In [22]: data
Out[22]:
             area     pop    density
California  43231  453231  10.483935
Texas       53432  533432   9.983381
New York    67632  676732  10.006092
Florida     89734  897534  10.002162
Illinois    63321  633241  10.000490

#获取DataFrame的数组值,返回的是Numpy二维数组
In [23]: data.values
Out[23]:
array([[4.32310000e+04, 4.53231000e+05, 1.04839351e+01],
       [5.34320000e+04, 5.33432000e+05, 9.98338075e+00],
       [6.76320000e+04, 6.76732000e+05, 1.00060918e+01],
       [8.97340000e+04, 8.97534000e+05, 1.00021619e+01],
       [6.33210000e+04, 6.33241000e+05, 1.00004896e+01]])

#对DataFrame的数组进行装置,相当于Numpy二维数组的转置
In [24]: data.T
Out[24]:
            California          Texas       New York        Florida      Illinois
area      43231.000000   53432.000000   67632.000000   89734.000000   63321.00000
pop      453231.000000  533432.000000  676732.000000  897534.000000  633241.00000
density      10.483935       9.983381      10.006092      10.002162      10.00049

#获得某一行的数组(方法同Numpy数组)
In [25]: data.values[0]
Out[25]: array([4.32310000e+04, 4.53231000e+05, 1.04839351e+01])
    
#布尔索引
In [27]: data[data > 10]
Out[27]:
             area     pop    density
California  43231  453231  10.483935
Texas       53432  533432        NaN
New York    67632  676732  10.006092
Florida     89734  897534  10.002162
Illinois    63321  633241  10.000490

In [28]: data[data['density']>10]
Out[28]:
             area     pop    density
California  43231  453231  10.483935
New York    67632  676732  10.006092
Florida     89734  897534  10.002162
Illinois    63321  633241  10.000490

索引选择器(iloc 和 loc)

为了避免在Index和Columns为整数时,造成切片和取值时的混乱,Panda提供了一些索引器(indexer)属性来进行取值。

iloc:隐私索引(顺序整数)
#Series
In [53]: data
Out[53]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [54]: data.iloc[0:2]
Out[54]:
a    0.25
b    0.50
dtype: float64

In [55]: data.iloc[2]
Out[55]: 0.75
    
#DataFrame
In [22]: data
Out[22]:
             area     pop    density
California  43231  453231  10.483935
Texas       53432  533432   9.983381
New York    67632  676732  10.006092
Florida     89734  897534  10.002162
Illinois    63321  633241  10.000490

In [31]: data.iloc[:3,:2]
Out[31]:
             area     pop
California  43231  453231
Texas       53432  533432
New York    67632  676732

In [32]: data.iloc[0,2] = 100

In [33]: data
Out[33]:
             area     pop     density
California  43231  453231  100.000000
Texas       53432  533432    9.983381
New York    67632  676732   10.006092
Florida     89734  897534   10.002162
Illinois    63321  633241   10.000490

In [34]: data.iloc[3,2]
Out[34]: 10.002161945305012

In [35]: data.iloc[:3,2]
Out[35]:
California    100.000000
Texas           9.983381
New York       10.006092
Name: density, dtype: float64

In [36]: data.iloc[3,:2]
Out[36]:
area     89734.0
pop     897534.0
Name: Florida, dtype: float64

#切出的是DataFrame结构
In [37]: data.iloc[2:3,:2]
Out[37]:
           area     pop
New York  67632  676732
loc:显示索引(自定义)
#Series
In [53]: data
Out[53]:
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [56]: data.loc['b']
Out[56]: 0.5

In [57]: data.loc['a':'b']
Out[57]:
a    0.25
b    0.50
dtype: float64

In [58]: data.loc[['a','c']]
Out[58]:
a    0.25
c    0.75
dtype: float64
    
#DataFrame
In [38]: data
Out[38]:
             area     pop     density
California  43231  453231  100.000000
Texas       53432  533432    9.983381
New York    67632  676732   10.006092
Florida     89734  897534   10.002162
Illinois    63321  633241   10.000490

#显性索引切片时与整数切片不同,最后一个切片位置也要包含在内
In [39]: data['California':'Texas']
Out[39]:
             area     pop     density
California  43231  453231  100.000000
Texas       53432  533432    9.983381

In [40]: data.loc['California':'Texas']
Out[40]:
             area     pop     density
California  43231  453231  100.000000
Texas       53432  533432    9.983381

In [41]: data.loc[data['density']>10,['area','density']]
Out[41]:
             area     density
California  43231  100.000000
New York    67632   10.006092
Florida     89734   10.002162
Illinois    63321   10.000490

In [44]: data.loc[:,['pop']]
Out[44]:
               pop
California  453231
Texas       533432
New York    676732
Florida     897534
Illinois    633241

多级索引应用(Hierarchical Indexing)

为解决高维数组表达的需求而引入多级索引(Hierarchical indexing)。Pandas的MultiIndex类型提供了丰富的操作手法。

多级索引创建
#直接创建,index写成多级模式,MultiIndex在后台自动创建
In [3]: data = pd.DataFrame(np.random.rand(4,2),
                            index = [['a','a','b','b'],[1,2,1,2]],
                            columns=['data1','data2'])

In [4]: data
Out[4]:
        data1     data2
a 1  0.154166  0.773928
  2  0.859236  0.777436
b 1  0.468589  0.010060
  2  0.849230  0.929585

#使用元组字典进行创建,自动把元组元素分割转换成MultiIndex
In [5]: data = {('California',2000):3334323,('California',2010):33444332,
                ('Texas',2000):5555999,('Texas',2010):7778839}

In [6]: pd.Series(data)
Out[6]:
California  2000     3334323
            2010    33444332
Texas       2000     5555999
            2010     7778839
dtype: int64

#使用列表进行创建
In [59]: pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
Out[59]:
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],)

#使用元组进行创建
In [60]: pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
Out[60]:
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

#使用笛卡尔积进行创建
In [62]: pd.MultiIndex.from_product([['a','b'],[1,2]])
Out[62]:
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

#直接利用levels和labels(codes)进行创建
In [63]: pd.MultiIndex(levels=[['a','b'],[1,2]],labels=[[0,0,1,1],[0,1,0,1]])
D:\Anaconda\install\Scripts\ipython:1: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
Out[63]:
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [64]: pd.MultiIndex(levels=[['a','b'],[1,2]],codes=[[0,0,1,1],[0,1,0,1]])
Out[64]:
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )
多级索引取值与切片
Series
In [1]: import pandas as pd
In [2]: pop_index = pd.MultiIndex.from_product([['California','Texas','New York'],
                                                ['2000','2010']])
In [3]: data = [23343,33453,78893,34569,56493,89392]
In [4]: pop = pd.Series(data, index=pop_index, name='population')
In [5]: pop
Out[5]:
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
New York    2000    56493
            2010    89392
Name: population, dtype: int64

#添加索引名称        
In [6]: pop.index.names = ['state','year']
In [7]: pop
Out[7]:
state       year
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
New York    2000    56493
            2010    89392
Name: population, dtype: int64
In [23]: pop
Out[23]:
state       year
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
New York    2000    56493
            2010    89392
Name: population, dtype: int64

#利用显性索引进行局部切片时,需要MultiIndex按顺序排列的(利用sort_index函数实现)
In [25]: pop
Out[25]:
state       year
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
New York    2000    56493
            2010    89392
Name: population, dtype: int64

In [26]: pop = pop.sort_index()
In [27]: pop
Out[27]:
state       year
California  2000    23343
            2010    33453
New York    2000    56493
            2010    89392
Texas       2000    78893
            2010    34569
Name: population, dtype: int64

#显性索引局部切片
In [28]: pop.loc['California':'New York']
Out[28]:
state       year
California  2000    23343
            2010    33453
New York    2000    56493
            2010    89392
Name: population, dtype: int64

In [29]: pop.loc['California':'New York','2000']
Out[29]:
state       year
California  2000    23343
New York    2000    56493
Name: population, dtype: int64

#显性+花哨索引取值
In [30]: pop.loc[['California','Texas']]
Out[30]:
state       year
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
Name: population, dtype: int64

#隐性索引局部切片
In [32]: pop.iloc[:2]
Out[32]:
state       year
California  2000    23343
            2010    33453
Name: population, dtype: int64

#布尔索引取值
In [33]: pop[pop > 30000]
Out[33]:
state       year
California  2010    33453
New York    2000    56493
            2010    89392
Texas       2000    78893
            2010    34569
Name: population, dtype: int64
DataFrame
In [1]: import pandas as pd
In [4]: import numpy as np

#行标签
In [2]: index = pd.MultiIndex.from_product([[2019,2020],[1,2,3,4]],
                                           names=['year','quarter'])

#列标签
In [3]: columns = pd.MultiIndex.from_product([['Bill','Andy','Coco'],
                                              ['Temp','HR']],names=['name','type'])

#获取模拟数据
In [12]: data = np.round(np.random.randn(8,6),1)

In [15]: data[:,::2] *10
Out[15]:
array([[  8.,  -5., -10.],
       [  4.,  22.,   4.],
       [ 14., -19., -18.],
       [  8.,  -1.,  -9.],
       [  2.,  -5.,   4.],
       [ 17.,   7.,   4.],
       [  0.,   3.,   2.],
       [  3.,  11., -11.]])

In [17]: data += 37
In [18]: data
Out[18]:
array([[45. , 35.4, 32. , 38.1, 27. , 37.7],
       [41. , 36.9, 59. , 35.4, 41. , 37.1],
       [51. , 37.1, 18. , 37.5, 19. , 36.7],
       [45. , 37.6, 36. , 37.6, 28. , 37.5],
       [39. , 35.9, 32. , 37.2, 41. , 38. ],
       [54. , 36.8, 44. , 38.7, 41. , 36.9],
       [37. , 37. , 40. , 35.1, 39. , 37.9],
       [40. , 37.5, 48. , 37.5, 26. , 35.7]])

#创建带行列标签的DataFrame
In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)
In [20]: health_data
Out[20]:
name          Bill        Andy        Coco
type          Temp    HR  Temp    HR  Temp    HR
year quarter
2019 1        45.0  35.4  32.0  38.1  27.0  37.7
     2        41.0  36.9  59.0  35.4  41.0  37.1
     3        51.0  37.1  18.0  37.5  19.0  36.7
     4        45.0  37.6  36.0  37.6  28.0  37.5
2020 1        39.0  35.9  32.0  37.2  41.0  38.0
     2        54.0  36.8  44.0  38.7  41.0  36.9
     3        37.0  37.0  40.0  35.1  39.0  37.9
     4        40.0  37.5  48.0  37.5  26.0  35.7

#直接根据标签名称索引获取值
In [21]: health_data['Bill']
Out[21]:
type          Temp    HR
year quarter
2019 1        45.0  35.4
     2        41.0  36.9
     3        51.0  37.1
     4        45.0  37.6
2020 1        39.0  35.9
     2        54.0  36.8
     3        37.0  37.0
     4        40.0  37.5

In [22]: health_data['Bill', 'HR']
Out[22]:
year  quarter
2019  1          35.4
      2          36.9
      3          37.1
      4          37.6
2020  1          35.9
      2          36.8
      3          37.0
      4          37.5
Name: (Bill, HR), dtype: float64

#隐性索引获取值
In [24]: health_data.iloc[:2,:2]
Out[24]:
name          Bill
type          Temp    HR
year quarter
2019 1        45.0  35.4
     2        41.0  36.9

#显性索引获取值
In [23]: health_data.loc[:,('Bill','HR')]
Out[23]:
year  quarter
2019  1          35.4
      2          36.9
      3          37.1
      4          37.6
2020  1          35.9
      2          36.8
      3          37.0
      4          37.5
Name: (Bill, HR), dtype: float64
        
#使用IndexSlice对象对显性索引进行分割       
In [25]: idx = pd.IndexSlice
In [28]: health_data.loc[idx[:,1],idx[:,'HR']]
Out[28]:
name          Bill  Andy  Coco
type            HR    HR    HR
year quarter
2019 1        35.4  38.1  37.7
2020 1        35.9  37.2  38.0

In [29]: health_data.loc[idx[:,2],idx[:,'HR']]
Out[29]:
name          Bill  Andy  Coco
type            HR    HR    HR
year quarter
2019 2        36.9  35.4  37.1
2020 2        36.8  38.7  36.9
多级索引排序和转换
In [18]: frame = pd.DataFrame(np.arange(12).reshape((4,3)),
    ...:                     index=[list('aabb'),list('1212')],
    ...:                     columns = [['Ohio','Ohio','Colorado'],
    ...:                                 ['Green','Red','Green']])

In [19]: frame
Out[19]:
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

In [20]: frame.index.names = ['key1','key2']

In [21]: frame.columns.names = ['state','color']

In [22]: frame
Out[22]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

#交换Index
In [26]: frame.swaplevel('key1','key2')
Out[26]:
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
2    a        3   4        5
1    b        6   7        8
2    b        9  10       11

In [27]: frame
Out[27]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

In [28]: frame.sort_index(level=1)
Out[28]:
state      Ohio     Colorado
color     Green Red    Green
key1 key2
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

In [29]: frame.swaplevel(0,1).sort_index(level=0)
Out[29]:
state      Ohio     Colorado
color     Green Red    Green
key2 key1
1    a        0   1        2
     b        6   7        8
2    a        3   4        5
     b        9  10       11
多级索引行列转换和重置(stack、unstack、set_index、reset_index)
Series
In [6]: pop
Out[6]:
state       year
California  2000    23343
            2010    33453
Texas       2000    78893
            2010    34569
New York    2000    56493
            2010    89392
Name: population, dtype: int64

#通过level参数设置转换索引的层级
In [7]: pop.unstack(level=0)
Out[7]:
state  California  New York  Texas
year
2000        23343     56493  78893
2010        33453     89392  34569

In [8]: pop.unstack(level=1)
Out[8]:
year         2000   2010
state
California  23343  33453
New York    56493  89392
Texas       78893  34569

In [9]: pop.unstack(level='year')
Out[9]:
year         2000   2010
state
California  23343  33453
New York    56493  89392
Texas       78893  34569

In [10]: pop.unstack(level='year').stack(level='year')
Out[10]:
state       year
California  2000    23343
            2010    33453
New York    2000    56493
            2010    89392
Texas       2000    78893
            2010    34569
dtype: int64

#reset_index()把层级索引变成列标签
In [11]: pop_flat = pop.reset_index(name='population')

In [12]: pop_flat
Out[12]:
        state  year  population
0  California  2000       23343
1  California  2010       33453
2       Texas  2000       78893
3       Texas  2010       34569
4    New York  2000       56493
5    New York  2010       89392

#set_index()把列标签设置成层级索引
In [13]: pop_flat.set_index(['state','year'])
Out[13]:
                 population
state      year
California 2000       23343
           2010       33453
Texas      2000       78893
           2010       34569
New York   2000       56493
           2010       89392
DataFrame
In [30]: health_data
Out[30]:
name          Bill        Andy        Coco
type          Temp    HR  Temp    HR  Temp    HR
year quarter
2019 1        45.0  35.4  32.0  38.1  27.0  37.7
     2        41.0  36.9  59.0  35.4  41.0  37.1
     3        51.0  37.1  18.0  37.5  19.0  36.7
     4        45.0  37.6  36.0  37.6  28.0  37.5
2020 1        39.0  35.9  32.0  37.2  41.0  38.0
     2        54.0  36.8  44.0  38.7  41.0  36.9
     3        37.0  37.0  40.0  35.1  39.0  37.9
     4        40.0  37.5  48.0  37.5  26.0  35.7

#从列堆积到行:stack(level=*),*表示索引的序号,用隐性索引则自上而下从0开始索引,用显性索引可直接赋值索引名称
#默认level=1
In [31]: health_data.stack()
Out[31]:
name               Andy  Bill  Coco
year quarter type
2019 1       HR    38.1  35.4  37.7
             Temp  32.0  45.0  27.0
     2       HR    35.4  36.9  37.1
             Temp  59.0  41.0  41.0
     3       HR    37.5  37.1  36.7
             Temp  18.0  51.0  19.0
     4       HR    37.6  37.6  37.5
             Temp  36.0  45.0  28.0
2020 1       HR    37.2  35.9  38.0
             Temp  32.0  39.0  41.0
     2       HR    38.7  36.8  36.9
             Temp  44.0  54.0  41.0
     3       HR    35.1  37.0  37.9
             Temp  40.0  37.0  39.0
     4       HR    37.5  37.5  35.7
             Temp  48.0  40.0  26.0

In [33]: health_data.stack(level=1)
Out[33]:
name               Andy  Bill  Coco
year quarter type
2019 1       HR    38.1  35.4  37.7
             Temp  32.0  45.0  27.0
     2       HR    35.4  36.9  37.1
             Temp  59.0  41.0  41.0
     3       HR    37.5  37.1  36.7
             Temp  18.0  51.0  19.0
     4       HR    37.6  37.6  37.5
             Temp  36.0  45.0  28.0
2020 1       HR    37.2  35.9  38.0
             Temp  32.0  39.0  41.0
     2       HR    38.7  36.8  36.9
             Temp  44.0  54.0  41.0
     3       HR    35.1  37.0  37.9
             Temp  40.0  37.0  39.0
     4       HR    37.5  37.5  35.7
             Temp  48.0  40.0  26.0
        
In [32]: health_data.stack(level=0)
Out[32]:
type                 HR  Temp
year quarter name
2019 1       Andy  38.1  32.0
             Bill  35.4  45.0
             Coco  37.7  27.0
     2       Andy  35.4  59.0
             Bill  36.9  41.0
             Coco  37.1  41.0
     3       Andy  37.5  18.0
             Bill  37.1  51.0
             Coco  36.7  19.0
     4       Andy  37.6  36.0
             Bill  37.6  45.0
             Coco  37.5  28.0
2020 1       Andy  37.2  32.0
             Bill  35.9  39.0
             Coco  38.0  41.0
     2       Andy  38.7  44.0
             Bill  36.8  54.0
             Coco  36.9  41.0
     3       Andy  35.1  40.0
             Bill  37.0  37.0
             Coco  37.9  39.0
     4       Andy  37.5  48.0
             Bill  37.5  40.0
             Coco  35.7  26.0
            
In [38]: health_data.stack(level='name')
Out[38]:
type                 HR  Temp
year quarter name
2019 1       Andy  38.1  32.0
             Bill  35.4  45.0
             Coco  37.7  27.0
     2       Andy  35.4  59.0
             Bill  36.9  41.0
             Coco  37.1  41.0
     3       Andy  37.5  18.0
             Bill  37.1  51.0
             Coco  36.7  19.0
     4       Andy  37.6  36.0
             Bill  37.6  45.0
             Coco  37.5  28.0
2020 1       Andy  37.2  32.0
             Bill  35.9  39.0
             Coco  38.0  41.0
     2       Andy  38.7  44.0
             Bill  36.8  54.0
             Coco  36.9  41.0
     3       Andy  35.1  40.0
             Bill  37.0  37.0
             Coco  37.9  39.0
     4       Andy  37.5  48.0
             Bill  37.5  40.0
             Coco  35.7  26.0
            
            
#从行散开到列:unstack(level=*),*表示索引的序号,用隐性索引则自左而右从0开始索引,用显性索引可直接赋值索引名称
#默认level=1

In [35]: health_data.unstack(level=0)
Out[35]:
name     Bill                    Andy                    Coco
type     Temp          HR        Temp          HR        Temp          HR
year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020
quarter
1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0
2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9
3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9
4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7

In [36]: health_data.unstack(level=1)
Out[36]:
name     Bill                                            Andy  ...        Coco
type     Temp                      HR                    Temp  ...    HR  Temp                      HR
quarter     1     2     3     4     1     2     3     4     1  ...     4     1     2     3     4     1     2     3     4
year                                                           ...
2019     45.0  41.0  51.0  45.0  35.4  36.9  37.1  37.6  32.0  ...  37.6  27.0  41.0  19.0  28.0  37.7  37.1  36.7  37.5
2020     39.0  54.0  37.0  40.0  35.9  36.8  37.0  37.5  32.0  ...  37.5  41.0  41.0  39.0  26.0  38.0  36.9  37.9  35.7

[2 rows x 24 columns]

In [39]: health_data.unstack(level='year')
Out[39]:
name     Bill                    Andy                    Coco
type     Temp          HR        Temp          HR        Temp          HR
year     2019  2020  2019  2020  2019  2020  2019  2020  2019  2020  2019  2020
quarter
1        45.0  39.0  35.4  35.9  32.0  32.0  38.1  37.2  27.0  41.0  37.7  38.0
2        41.0  54.0  36.9  36.8  59.0  44.0  35.4  38.7  41.0  41.0  37.1  36.9
3        51.0  37.0  37.1  37.0  18.0  40.0  37.5  35.1  19.0  39.0  36.7  37.9
4        45.0  40.0  37.6  37.5  36.0  48.0  37.6  37.5  28.0  26.0  37.5  35.7
多级索引数据统计方法
In [23]: health_data
Out[23]:
name          Bill        Andy        Coco
type          Temp    HR  Temp    HR  Temp    HR
year quarter
2019 1        37.1  35.3  37.2  37.5  36.2  37.6
     2        36.9  37.0  37.8  35.9  37.5  36.6
     3        36.8  34.7  35.4  37.5  37.0  38.2
     4        36.8  37.6  36.2  37.4  36.4  36.7
2020 1        37.5  34.8  36.8  36.5  36.3  37.4
     2        36.2  36.7  37.3  37.2  36.3  36.4
     3        37.7  38.1  36.9  37.7  36.6  36.1
     4        36.5  36.3  37.6  36.3  35.5  37.3

#level参数设置累计的索引层级
In [24]: data_mean = health_data.mean(level='year')
In [25]: data_mean
Out[25]:
name    Bill           Andy            Coco
type    Temp      HR   Temp      HR    Temp      HR
year
2019  36.900  36.150  36.65  37.075  36.775  37.275
2020  36.975  36.475  37.15  36.925  36.175  36.800

#axis参数设置累计的轴
In [26]: data_mean.mean(axis=1,level='type')
Out[26]:
type       Temp         HR
year
2019  36.775000  36.833333
2020  36.766667  36.733333
posted @ 2022-01-24 15:53  溪奇的数据  阅读(47)  评论(0编辑  收藏  举报