30分钟了解pandas

参考资料:https://pandas.pydata.org/docs/user_guide/10min.html

 

创建对象

创建一个Series对象

In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8])                                                            

In [151]: s                                                                                                 
Out[151]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

  

创建一个DataFrame对象通过NumPy的数组

In [152]: dates = pd.date_range('20130101', periods=6)                                                      

In [153]: dates                                                                                             
Out[153]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD'))                         

In [155]: df                                                                                                
Out[155]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954  

 

通过一个字典来生成一个df对象

In [179]: df2 = pd.DataFrame( 
     ...:    ...:     { 
     ...:    ...:         "A": 1.0, 
     ...:    ...:         "B": pd.Timestamp("20130102"), 
     ...:    ...:         "C": pd.Series(1, index=list(range(4)), dtype="float32"), 
     ...:    ...:         "D": np.array([3] * 4, dtype="int32"), 
     ...:    ...:         "E": pd.Categorical(["test", "train", "test", "train"]), 
     ...:    ...:         "F": "foo", 
     ...:    ...:     }, 
     ...:    ...:     
     ...: )                                                                                                 

In [180]: df2                                                                                               
Out[180]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

In [181]: df2.dtypes                                                                                        
Out[181]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [182]: 

  通过dtypes可以查看相关索引的内容类型。

 

查看数据

通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息

In [183]: df.head()                                                                                         
Out[183]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109

In [184]: df.tail(2)                                                                                        
Out[184]: 
                   A         B         C         D
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954

In [185]: df.head(1)                                                                                        
Out[185]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168

In [186]: df.index                                                                                          
Out[186]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [187]: df.columns                                                                                        
Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object')

In [188]:       

  通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的

In [194]: df.to_numpy()                                                                                     
Out[194]: 
array([[ 0.9119509 ,  0.11907694,  2.24459767, -1.52416844],
       [-0.71159066,  1.81432742,  0.85962346, -0.24911614],
       [-0.0414173 , -1.15847237,  1.03775241,  1.12435552],
       [ 1.22224697, -0.65168145,  1.76462966,  0.50050719],
       [-0.33219183, -1.42487132, -0.68025439,  0.37010889],
       [-0.72452176, -1.23899146,  0.22342519, -1.77595409]])

In [195]: df2.to_numpy()                                                                                    
Out[195]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

  上面演示了两种to_numpy的方式,其中第二种的花费要多很多。

describe() 将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......

In [202]: df.describe()                                                                                     
Out[202]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.054079 -0.423435  0.908296 -0.259045
std    0.830825  1.229820  1.051730  1.165178
min   -0.724522 -1.424871 -0.680254 -1.775954
25%   -0.616741 -1.218862  0.382475 -1.205405
50%   -0.186805 -0.905077  0.948688  0.060496
75%    0.673609 -0.073613  1.582910  0.467908
max    1.222247  1.814327  2.244598  1.124356

  通过T可以把行索引与列索引进行转置。

In [204]: df.T                                                                                              
Out[204]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.911951   -0.711591   -0.041417    1.222247   -0.332192   -0.724522
B    0.119077    1.814327   -1.158472   -0.651681   -1.424871   -1.238991
C    2.244598    0.859623    1.037752    1.764630   -0.680254    0.223425
D   -1.524168   -0.249116    1.124356    0.500507    0.370109   -1.775954

  通过索引进行排序

通过sort_index按照索引的数值大小,可以进行排序

In [207]: df.sort_index(ascending=False)                                                                    
Out[207]: 
                   A         B         C         D
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-01  0.911951  0.119077  2.244598 -1.524168

In [208]: df.sort_index?                                                                                    

In [209]: df.sort_index(axis=1, ascending=False)                                                            
Out[209]: 
                   D         C         B         A
2013-01-01 -1.524168  2.244598  0.119077  0.911951
2013-01-02 -0.249116  0.859623  1.814327 -0.711591
2013-01-03  1.124356  1.037752 -1.158472 -0.041417
2013-01-04  0.500507  1.764630 -0.651681  1.222247
2013-01-05  0.370109 -0.680254 -1.424871 -0.332192
2013-01-06 -1.775954  0.223425 -1.238991 -0.724522

  当然最后还可以通过具体某一列的数值大小进行排序

In [212]: df.sort_values(by='B')                                                                            
Out[212]: 
                   A         B         C         D
2013-01-05 -0.332192 -1.424871 -0.680254  0.370109
2013-01-06 -0.724522 -1.238991  0.223425 -1.775954
2013-01-03 -0.041417 -1.158472  1.037752  1.124356
2013-01-04  1.222247 -0.651681  1.764630  0.500507
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116

  

选择

pandas建议用过.at.iat.loc and .iloc.这4类方法进行取值。

选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据

In [226]: df['A']                                                                                           
Out[226]: 
2013-01-01    0.911951
2013-01-02   -0.711591
2013-01-03   -0.041417
2013-01-04    1.222247
2013-01-05   -0.332192
2013-01-06   -0.724522
Freq: D, Name: A, dtype: float64

  通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。

In [229]: df[0:3]                                                                                           
Out[229]: 
                   A         B         C         D
2013-01-01  0.911951  0.119077  2.244598 -1.524168
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [230]: df["20130102":'20130103']                                                                         
Out[230]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

  

通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.

 

In [238]: df.loc[dates[0]]                                                                                  
Out[238]: 
A    0.911951
B    0.119077
C    2.244598
D   -1.524168
Name: 2013-01-01 00:00:00, dtype: float64

  

取多列坐标数据

In [239]: df.loc[:,['A','B']]                                                                               
Out[239]: 
                   A         B
2013-01-01  0.911951  0.119077
2013-01-02 -0.711591  1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04  1.222247 -0.651681
2013-01-05 -0.332192 -1.424871
2013-01-06 -0.724522 -1.238991

 通过切片选择不同的行与列的信息

In [240]: df.loc["20130102":"20130104",["A","B"]]                                                           
Out[240]: 
                   A         B
2013-01-02 -0.711591  1.814327
2013-01-03 -0.041417 -1.158472
2013-01-04  1.222247 -0.651681

  只选取一行,选取指定列

In [241]: df.loc["20130102",["A","B"]]                                                                      
Out[241]: 
A   -0.711591
B    1.814327
Name: 2013-01-02 00:00:00, dtype: float64

 最后介绍两种方式取出单个数值,也就是所谓的标量

In [242]: df.loc[dates[0],"A"]                                                                              
Out[242]: 0.91195089904327

In [243]: df.at[dates[0],"A"]                                                                               
Out[243]: 0.91195089904327

In [244]:  

  

通过位置来取值,也就是所谓的隐式传参 iloc

跟loc使用差不多,可以传入单值,多值,还有切片

单值,返回某一行的数据的Series对象

In [244]: df.iloc[0]                                                                                        
Out[244]: 
A    0.911951
B    0.119077
C    2.244598
D   -1.524168
Name: 2013-01-01 00:00:00, dtype: float64

In [245]: df.iloc[3]                                                                                        
Out[245]: 
A    1.222247
B   -0.651681
C    1.764630
D    0.500507
Name: 2013-01-04 00:00:00, dtype: float64

  也可以通过切片,这是跟Python一样,取头不取尾的

In [246]: df.iloc[3:5,0:2]                                                                                  
Out[246]: 
                   A         B
2013-01-04  1.222247 -0.651681
2013-01-05 -0.332192 -1.424871

  还可以传入多个位置参数来取,位置参数之间用逗号分割

In [247]: df.iloc[[1,2,4],[0,2]]                                                                            
Out[247]: 
                   A         C
2013-01-02 -0.711591  0.859623
2013-01-03 -0.041417  1.037752
2013-01-05 -0.332192 -0.680254

  可以通过单个:来全选数据

In [248]: df.iloc[1:3]                                                                                      
Out[248]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [249]: df.iloc[1:3,:]                                                                                    
Out[249]: 
                   A         B         C         D
2013-01-02 -0.711591  1.814327  0.859623 -0.249116
2013-01-03 -0.041417 -1.158472  1.037752  1.124356

In [250]: df.iloc[:,1:3]                                                                                    
Out[250]: 
                   B         C
2013-01-01  0.119077  2.244598
2013-01-02  1.814327  0.859623
2013-01-03 -1.158472  1.037752
2013-01-04 -0.651681  1.764630
2013-01-05 -1.424871 -0.680254
2013-01-06 -1.238991  0.223425

  也可以通过输入两个数值坐标来取标量。

In [251]: df.iloc[1,1]                                                                                      
Out[251]: 1.8143274155708045

In [252]: df.iat[1,1]                                                                                       
Out[252]: 1.8143274155708045

  

bolllean 索引,有些书称为掩码取值

 使用单一的列数据来选择数据

In [8]: df[df['A'] > 0]                                                                                                                                                
Out[8]: 
                   A         B         C         D
2013-01-02  0.314732  1.187635 -0.290305 -1.091247
2013-01-03  0.535155  0.453357  0.308055 -0.533740
2013-01-04  0.688215 -1.474558 -2.561580 -1.211188
2013-01-06  1.104539 -1.272097  1.423444  0.417609

  针对df对象所有对象进行选值

In [9]: df[df>0]                                                                                                                                                       
Out[9]: 
                   A         B         C         D
2013-01-01       NaN  0.954575       NaN  0.070635
2013-01-02  0.314732  1.187635       NaN       NaN
2013-01-03  0.535155  0.453357  0.308055       NaN
2013-01-04  0.688215       NaN       NaN       NaN
2013-01-05       NaN       NaN  1.522803       NaN
2013-01-06  1.104539       NaN  1.423444  0.417609

  使用isin()来选择数据

In [14]: df2                                                                                                                                                           
Out[14]: 
                   A         B         C         D      E
2013-01-01 -1.629264  0.954575 -0.102982  0.070635    one
2013-01-02  0.314732  1.187635 -0.290305 -1.091247    one
2013-01-03  0.535155  0.453357  0.308055 -0.533740    two
2013-01-04  0.688215 -1.474558 -2.561580 -1.211188  three
2013-01-05 -1.579950 -0.317184  1.522803 -0.702557   four
2013-01-06  1.104539 -1.272097  1.423444  0.417609  three

In [15]: df2[df2['E'].isin(['two','four'])]                                                                                                                            
Out[15]: 
                   A         B         C         D     E
2013-01-03  0.535155  0.453357  0.308055 -0.533740   two
2013-01-05 -1.579950 -0.317184  1.522803 -0.702557  four

  设定

设置一个新的列将自动对齐数据的索引。

In [16]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6))                                                                                       

In [17]: s1                                                                                                                                                            
Out[17]: 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [18]: df['F'] = s1  

  分别通过便签与索引设定值

In [19]: df.at[dates[0],"A"] = 0                                                                                                                                       

In [20]: df.iat[0,1] = 0                                                                                                                                               

  通过NumPy的数组来分配数据

In [26]: df.loc[:,"D"] = np.array([5] * len(df))                                                                                                                       

In [27]: df                                                                                                                                                            
Out[27]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.102982  5  NaN
2013-01-02  0.314732  1.187635 -0.290305  5  1.0
2013-01-03  0.535155  0.453357  0.308055  5  2.0
2013-01-04  0.688215 -1.474558 -2.561580  5  3.0
2013-01-05 -1.579950 -0.317184  1.522803  5  4.0
2013-01-06  1.104539 -1.272097  1.423444  5  5.0

  操作具体选定的数据

In [30]: df2                                                                                                                                                           
Out[30]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.102982  5  NaN
2013-01-02  0.314732  1.187635 -0.290305  5  1.0
2013-01-03  0.535155  0.453357  0.308055  5  2.0
2013-01-04  0.688215 -1.474558 -2.561580  5  3.0
2013-01-05 -1.579950 -0.317184  1.522803  5  4.0
2013-01-06  1.104539 -1.272097  1.423444  5  5.0

In [31]: df2[df2>0] = -df2                                                                                                                                             

In [32]: df2                                                                                                                                                           
Out[32]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -0.102982 -5  NaN
2013-01-02 -0.314732 -1.187635 -0.290305 -5 -1.0
2013-01-03 -0.535155 -0.453357 -0.308055 -5 -2.0
2013-01-04 -0.688215 -1.474558 -2.561580 -5 -3.0
2013-01-05 -1.579950 -0.317184 -1.522803 -5 -4.0
2013-01-06 -1.104539 -1.272097 -1.423444 -5 -5.0

  缺失数据

pandas主要使用np.nan来表示缺失值,一般情况下它不参与计算。

使用Reindexing可以改变指定象限的索引值。

In [17]: df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ["E"])                                                                                               

In [18]: df1                                                                                                                                                               
Out[18]: 
                   A         B         C  D    F   E
2013-01-01  0.000000  0.000000  0.009247  5  NaN NaN
2013-01-02  1.012606  0.986936  1.007574  5  1.0 NaN
2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0 NaN
2013-01-04  1.957053  1.212591  0.639196  5  3.0 NaN

In [19]: df1.loc[dates[0]:dates[1],'E'] = 1                                                                                                                                

In [20]: df1                                                                                                                                                               
Out[20]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  0.009247  5  NaN  1.0
2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0
2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0  NaN
2013-01-04  1.957053  1.212591  0.639196  5  3.0  NaN

  放弃任何有nan的行

In [21]: df1.dropna(how='any')                                                                                                                                             
Out[21]: 
                   A         B         C  D    F    E
2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0

  填写缺失的数据

In [22]: df1.fillna(value=5)                                                                                                                                               
Out[22]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000  0.009247  5  5.0  1.0
2013-01-02  1.012606  0.986936  1.007574  5  1.0  1.0
2013-01-03 -0.659537 -0.193782 -0.037052  5  2.0  5.0
2013-01-04  1.957053  1.212591  0.639196  5  3.0  5.0

  通过bool掩码值获取是否是nan的df对象

In [23]: pd.isna(df1)                                                                                                                                                      
Out[23]: 
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

  Operations[操作]

 

Stats[操作]

操作一般不包括缺失的数据,进行描述性统计

In [12]: df                                                                                                                                                      
Out[12]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000  0.913359  5  NaN
2013-01-02 -1.970074  0.380875 -0.648481  5  1.0
2013-01-03  0.281023  0.861409  0.084303  5  2.0
2013-01-04 -0.795286  1.961307  1.569313  5  3.0
2013-01-05  0.974005  0.648449 -1.928545  5  4.0
2013-01-06  0.000812 -0.274185 -0.469916  5  5.0

In [13]: df.mean()                                                                                                                                               
Out[13]: 
A   -0.251587
B    0.596309
C   -0.079994
D    5.000000
F    3.000000
dtype: float64

In [14]: df.mean(1)                                                                                                                                              
Out[14]: 
2013-01-01    1.478340
2013-01-02    0.752464
2013-01-03    1.645347
2013-01-04    2.147067
2013-01-05    1.738782
2013-01-06    1.851342
Freq: D, dtype: float64

  针对不同的象限取平均。

操作不同维度和需要对齐的对象。pandas将会自动沿指定的维度广播。

In [27]: s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)                                                                                                  

In [28]: s                                                                                                                                                       
Out[28]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [29]: df.sub?                                                                                                                                                 

In [30]: df.sub(s,axis='index')                                                                                                                                  
Out[30]: 
                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03 -0.718977 -0.138591 -0.915697  4.0  1.0
2013-01-04 -3.795286 -1.038693 -1.430687  2.0  0.0
2013-01-05 -4.025995 -4.351551 -6.928545  0.0 -1.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

  

Apply

申请

Applying functions to the data:

将函数应用与数据

In [31]: df                                                                                                                                                      
Out[31]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000  0.913359  5  NaN
2013-01-02 -1.970074  0.380875 -0.648481  5  1.0
2013-01-03  0.281023  0.861409  0.084303  5  2.0
2013-01-04 -0.795286  1.961307  1.569313  5  3.0
2013-01-05  0.974005  0.648449 -1.928545  5  4.0
2013-01-06  0.000812 -0.274185 -0.469916  5  5.0

In [32]: df.apply(np.cumsum)                                                                                                                                     
Out[32]: 
                   A         B         C   D     F
2013-01-01  0.000000  0.000000  0.913359   5   NaN
2013-01-02 -1.970074  0.380875  0.264878  10   1.0
2013-01-03 -1.689051  1.242284  0.349182  15   3.0
2013-01-04 -2.484337  3.203590  1.918495  20   6.0
2013-01-05 -1.510332  3.852039 -0.010051  25  10.0
2013-01-06 -1.509519  3.577854 -0.479966  30  15.0

In [33]: df.apply?                                                                                                                                               

In [34]: df.apply(lambda x: x.max() - x.min())                                                                                                                   
Out[34]: 
A    2.944080
B    2.235492
C    3.497859
D    0.000000
F    4.000000
dtype: float64

In [35]:       

  默认对列数据进行操作,添加一个函数。

Histogramming

柱状图

See more at Histogramming and Discretization.

In [35]: s = pd.Series(np.random.randint(0,7,size=10))                                                                                                           

In [36]: s                                                                                                                                                       
Out[36]: 
0    1
1    0
2    1
3    3
4    4
5    0
6    0
7    5
8    0
9    1
dtype: int64

In [37]: s.value_counts()                                                                                                                                        
Out[37]: 
0    4
1    3
5    1
4    1
3    1
dtype: int64

In [38]: s.value_counts().__class__                                                                                                                              
Out[38]: pandas.core.series.Series

In [39]: s.value_counts().index                                                                                                                                  
Out[39]: Int64Index([0, 1, 5, 4, 3], dtype='int64')

  前面的索引时出现的数字,后面是出现的次数。

String Methods

字符串方法

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

Series在str属性中配备了一组字符串处理方法,这些方法可以轻松地对数组的每个元素进行操作,

In [40]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])                                                                            

In [41]: s.str.lower()                                                                                                                                           
Out[41]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

In [42]:      

Merge

合并

Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the Merging section.

Concatenating pandas objects together with concat():

用concat连接pandas对象

In [6]: df = pd.DataFrame(np.random.randn(10, 4),index=[chr(i) for i in range(65,75)])                                                          

In [7]: df                                                                                                                                      
Out[7]: 
          0         1         2         3
A -0.352374  1.278078  0.543435  0.085902
B -0.572048 -0.053377  1.278151 -0.143507
C  0.058471 -0.678905 -0.928697 -0.275203
D -0.338097  2.616637 -1.541948  0.127263
E -0.460966 -1.571678 -1.958119 -0.533291
F -0.832041 -0.103917  1.635657  0.887229
G -0.171535 -2.181684 -1.682033  0.462375
H -1.001720  0.925746 -0.205054 -1.493562
I -0.213802 -1.214741 -0.864377  0.735728
J -0.375842  0.406692  0.207853  0.626410

In [8]: prices = [df[:3],df[3:7],df[7:]]                                                                                                        

In [9]: prices                                                                                                                                  
Out[9]: 
[          0         1         2         3
 A -0.352374  1.278078  0.543435  0.085902
 B -0.572048 -0.053377  1.278151 -0.143507
 C  0.058471 -0.678905 -0.928697 -0.275203,
           0         1         2         3
 D -0.338097  2.616637 -1.541948  0.127263
 E -0.460966 -1.571678 -1.958119 -0.533291
 F -0.832041 -0.103917  1.635657  0.887229
 G -0.171535 -2.181684 -1.682033  0.462375,
           0         1         2         3
 H -1.001720  0.925746 -0.205054 -1.493562
 I -0.213802 -1.214741 -0.864377  0.735728
 J -0.375842  0.406692  0.207853  0.626410]

In [10]: pd.concat(prices)                                                                                                                      
Out[10]: 
          0         1         2         3
A -0.352374  1.278078  0.543435  0.085902
B -0.572048 -0.053377  1.278151 -0.143507
C  0.058471 -0.678905 -0.928697 -0.275203
D -0.338097  2.616637 -1.541948  0.127263
E -0.460966 -1.571678 -1.958119 -0.533291
F -0.832041 -0.103917  1.635657  0.887229
G -0.171535 -2.181684 -1.682033  0.462375
H -1.001720  0.925746 -0.205054 -1.493562
I -0.213802 -1.214741 -0.864377  0.735728
J -0.375842  0.406692  0.207853  0.626410

  

Join

SQL style merges. See the Database style joining section.

类似与SQL样式的合并

In [21]: left                                                                                                                                   
Out[21]: 
   key  lval
0  foo     1
1  foo     2

In [22]: right                                                                                                                                  
Out[22]: 
   key  rval
0  foo     4
1  foo     5

In [23]: pd.merge(left, right, on="key")                                                                                                        
Out[23]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

  类似与sql里面的join的内联

另一种形式的样式

In [24]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})                                                                           

In [25]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) 
    ...:                                                                                                                                        

In [26]: pd.merge(left, right, on="key")                                                                                                        
Out[26]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

In [27]: left                                                                                                                                   
Out[27]: 
   key  lval
0  foo     1
1  bar     2

In [28]: right                                                                                                                                  
Out[28]: 
   key  rval
0  foo     4
1  bar     5

  我的理解与测试,需要两个相同的列名,取两个列名中的交集。

 

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

  • Splitting the data into groups based on some criteria

  • 根据某些条件将数据分成不同的组
  • Applying a function to each group independently

  • 将功能独立应用到每一个组
  • Combining the results into a data structure

  • 将结果合并成一个数据结构

See the Grouping section.

 

In [33]: df = pd.DataFrame( 
    ...:    ....:     { 
    ...:    ....:         "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], 
    ...:    ....:         "B": ["one", "one", "two", "three", "two", "two", "one", "three"], 
    ...:    ....:         "C": np.random.randn(8), 
    ...:    ....:         "D": np.random.randn(8), 
    ...:    ....:     } 
    ...:    ....: )                                                                                                                             

In [34]: df                                                                                                                                     
Out[34]: 
     A      B         C         D
0  foo    one  0.964528 -0.464514
1  bar    one  2.223520  1.743989
2  foo    two -0.311332  0.242599
3  bar  three -1.119041 -1.223514
4  foo    two -0.879870 -1.140529
5  bar    two -1.158119 -0.577413
6  foo    one  0.349084 -0.770265
7  foo  three -0.063142 -2.306210

In [35]: df.groupby('A')                                                                                                                        
Out[35]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87e4b64d60>

In [36]: df.groupby('A').count()                                                                                                                
Out[36]: 
     B  C  D
A           
bar  3  3  3
foo  5  5  5

In [37]: df.groupby('A').sum()                                                                                                                  
Out[37]: 
            C         D
A                      
bar -0.053639 -0.056937
foo  0.059268 -4.438920

  Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.

In [38]: df.groupby(["A", "B"]).sum()                                                                                                           
Out[38]: 
                  C         D
A   B                        
bar one    2.223520  1.743989
    three -1.119041 -1.223514
    two   -1.158119 -0.577413
foo one    1.313612 -1.234779
    three -0.063142 -2.306210
    two   -1.191203 -0.897931

  

Reshaping

See the sections on Hierarchical Indexing and Reshaping.

tuples                                                                                                                                 
Out[49]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [50]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])                                                                   

In [52]: df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B'])                                                                  

In [53]: df                                                                                                                                     
Out[53]: 
                     A         B
first second                    
bar   one     1.468431  1.239477
      two     1.134492  0.163859
baz   one    -0.646760 -1.188458
      two     1.389973 -1.437824
foo   one     0.438599  0.053046
      two     0.588275  0.074167
qux   one    -0.197791 -1.983309
      two     0.725947  0.305324

In [54]: df = df[:4]                                                                                                                            

In [55]: df                                                                                                                                     
Out[55]: 
                     A         B
first second                    
bar   one     1.468431  1.239477
      two     1.134492  0.163859
baz   one    -0.646760 -1.188458
      two     1.389973 -1.437824

  建了一份数据,用了联合索引,取了前面4条

The stack() method “compresses” a level in the DataFrame’s columns.

我的理解是stack是将列的索引添加到行的索引进行叠加,使行索引变成多层索引[多一层索引],多列数据变成单列数据[少一层索引],unstack就使将行索引转换到列索引了。

In [76]: stacked = df.stack()                                                                                                                   

In [77]: stacked                                                                                                                                
Out[77]: 
first  second   
bar    one     A    1.468431
               B    1.239477
       two     A    1.134492
               B    0.163859
baz    one     A   -0.646760
               B   -1.188458
       two     A    1.389973
               B   -1.437824
dtype: float64

In [78]: stacked.index                                                                                                                          
Out[78]: 
MultiIndex([('bar', 'one', 'A'),
            ('bar', 'one', 'B'),
            ('bar', 'two', 'A'),
            ('bar', 'two', 'B'),
            ('baz', 'one', 'A'),
            ('baz', 'one', 'B'),
            ('baz', 'two', 'A'),
            ('baz', 'two', 'B')],
           names=['first', 'second', None])

In [79]: stacked.__class__                                                                                                                      
Out[79]: pandas.core.series.Series

  从输出可以看到,stack的方法,将df从df对象,转换成Series对象,本来两层的index,变成了三层

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:

 

In [80]: stacked                                                                                                                                
Out[80]: 
first  second   
bar    one     A    1.468431
               B    1.239477
       two     A    1.134492
               B    0.163859
baz    one     A   -0.646760
               B   -1.188458
       two     A    1.389973
               B   -1.437824
dtype: float64

In [81]: stacked.unstack()                                                                                                                      
Out[81]: 
                     A         B
first second                    
bar   one     1.468431  1.239477
      two     1.134492  0.163859
baz   one    -0.646760 -1.188458
      two     1.389973 -1.437824

In [82]: stacked.unstack().__class__                                                                                                            
Out[82]: pandas.core.frame.DataFrame

In [83]: stacked.unstack(0)                                                                                                                     
Out[83]: 
first          bar       baz
second                      
one    A  1.468431 -0.646760
       B  1.239477 -1.188458
two    A  1.134492  1.389973
       B  0.163859 -1.437824

  有了stack的基础,unstack就更加好理解了,默认转移的是多层index的最后一层,但也可以指定多层index的索引。

 

Pivot tables

See the section on Pivot Tables.

 数据透视表,根据我自身的理解,感觉就是在原有的数据上面,挑选合适的列选为index,合适的列的column,合适的列为值。
In [84]: df = pd.DataFrame( 
    ...:    .....:     { 
    ...:    .....:         "A": ["one", "one", "two", "three"] * 3, 
    ...:    .....:         "B": ["A", "B", "C"] * 4, 
    ...:    .....:         "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2, 
    ...:    .....:         "D": np.random.randn(12), 
    ...:    .....:         "E": np.random.randn(12), 
    ...:    .....:     } 
    ...:    .....: ) 
    ...:    .....:                                                                                                                              

In [85]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])                                                                        
Out[85]: 
C             bar       foo
A     B                    
one   A -1.549085  0.117403
      B -0.343714 -0.231218
      C -0.890250 -1.735391
three A -0.590521       NaN
      B       NaN -2.155207
      C -0.771405       NaN
two   A       NaN  1.263717
      B  0.061399       NaN
      C       NaN  0.912017

In [86]: pd.pivot_table(df, values="E", index=["A", "B"], columns=["C"])                                                                        
Out[86]: 
C             bar       foo
A     B                    
one   A -0.908049  0.088130
      B -0.661966  1.091079
      C -1.096862  0.875454
three A -0.425327       NaN
      B       NaN  0.474478
      C -0.928048       NaN
two   A       NaN -0.663361
      B -1.199722       NaN
      C       NaN -0.270919

In [87]: pd.pivot_table(df, values=["E",'D'], index=["A", "B"], columns=["C"])                                                                  
Out[87]: 
                D                   E          
C             bar       foo       bar       foo
A     B                                        
one   A -1.549085  0.117403 -0.908049  0.088130
      B -0.343714 -0.231218 -0.661966  1.091079
      C -0.890250 -1.735391 -1.096862  0.875454
three A -0.590521       NaN -0.425327       NaN
      B       NaN -2.155207       NaN  0.474478
      C -0.771405       NaN -0.928048       NaN
two   A       NaN  1.263717       NaN -0.663361
      B  0.061399       NaN -1.199722       NaN
      C       NaN  0.912017       NaN -0.270919

  

Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

 

 

 

  

 

posted @ 2021-02-03 17:58  就是想学习  阅读(144)  评论(0编辑  收藏  举报