30分钟了解pandas
参考资料:https://pandas.pydata.org/docs/user_guide/10min.html
创建对象
创建一个Series对象
In [150]: s = pd.Series([1, 3, 5, np.nan, 6, 8]) In [151]: s Out[151]: 0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
创建一个DataFrame对象通过NumPy的数组
In [152]: dates = pd.date_range('20130101', periods=6) In [153]: dates Out[153]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [154]: df = pd.DataFrame(np.random.randn(6,4),index=dates, columns=list('ABCD')) In [155]: df Out[155]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954
通过一个字典来生成一个df对象
In [179]: df2 = pd.DataFrame( ...: ...: { ...: ...: "A": 1.0, ...: ...: "B": pd.Timestamp("20130102"), ...: ...: "C": pd.Series(1, index=list(range(4)), dtype="float32"), ...: ...: "D": np.array([3] * 4, dtype="int32"), ...: ...: "E": pd.Categorical(["test", "train", "test", "train"]), ...: ...: "F": "foo", ...: ...: }, ...: ...: ...: ) In [180]: df2 Out[180]: A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo In [181]: df2.dtypes Out[181]: A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object In [182]:
通过dtypes可以查看相关索引的内容类型。
查看数据
通过head, tail,可以查看部分内容,通过index与columns可以查看行索引与列索引的信息
In [183]: df.head() Out[183]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 In [184]: df.tail(2) Out[184]: A B C D 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 In [185]: df.head(1) Out[185]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 In [186]: df.index Out[186]: DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04', '2013-01-05', '2013-01-06'], dtype='datetime64[ns]', freq='D') In [187]: df.columns Out[187]: Index(['A', 'B', 'C', 'D'], dtype='object') In [188]:
通过DataFrame.to_numpy()可以给你一个NumPy的array数据,但有时候这会很耗费电脑,因为pandas是根据每一列一个dtypes的,而NumPy的array是整个数据一个dtype的
In [194]: df.to_numpy() Out[194]: array([[ 0.9119509 , 0.11907694, 2.24459767, -1.52416844], [-0.71159066, 1.81432742, 0.85962346, -0.24911614], [-0.0414173 , -1.15847237, 1.03775241, 1.12435552], [ 1.22224697, -0.65168145, 1.76462966, 0.50050719], [-0.33219183, -1.42487132, -0.68025439, 0.37010889], [-0.72452176, -1.23899146, 0.22342519, -1.77595409]]) In [195]: df2.to_numpy() Out[195]: array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
上面演示了两种to_numpy的方式,其中第二种的花费要多很多。
describe()
将会展示一些数据的基础信息,比如统计数量,标准差,最小值,最大值.......
In [202]: df.describe() Out[202]: A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.054079 -0.423435 0.908296 -0.259045 std 0.830825 1.229820 1.051730 1.165178 min -0.724522 -1.424871 -0.680254 -1.775954 25% -0.616741 -1.218862 0.382475 -1.205405 50% -0.186805 -0.905077 0.948688 0.060496 75% 0.673609 -0.073613 1.582910 0.467908 max 1.222247 1.814327 2.244598 1.124356
通过T可以把行索引与列索引进行转置。
In [204]: df.T Out[204]: 2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06 A 0.911951 -0.711591 -0.041417 1.222247 -0.332192 -0.724522 B 0.119077 1.814327 -1.158472 -0.651681 -1.424871 -1.238991 C 2.244598 0.859623 1.037752 1.764630 -0.680254 0.223425 D -1.524168 -0.249116 1.124356 0.500507 0.370109 -1.775954
通过索引进行排序
通过sort_index按照索引的数值大小,可以进行排序
In [207]: df.sort_index(ascending=False) Out[207]: A B C D 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-01 0.911951 0.119077 2.244598 -1.524168 In [208]: df.sort_index? In [209]: df.sort_index(axis=1, ascending=False) Out[209]: D C B A 2013-01-01 -1.524168 2.244598 0.119077 0.911951 2013-01-02 -0.249116 0.859623 1.814327 -0.711591 2013-01-03 1.124356 1.037752 -1.158472 -0.041417 2013-01-04 0.500507 1.764630 -0.651681 1.222247 2013-01-05 0.370109 -0.680254 -1.424871 -0.332192 2013-01-06 -1.775954 0.223425 -1.238991 -0.724522
当然最后还可以通过具体某一列的数值大小进行排序
In [212]: df.sort_values(by='B') Out[212]: A B C D 2013-01-05 -0.332192 -1.424871 -0.680254 0.370109 2013-01-06 -0.724522 -1.238991 0.223425 -1.775954 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 2013-01-04 1.222247 -0.651681 1.764630 0.500507 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116
选择
pandas建议用过.at
, .iat
, .loc
and .iloc
.这4类方法进行取值。
选择单列数据,返回的是一个Series对象,单个数值的时候返回列数据
In [226]: df['A'] Out[226]: 2013-01-01 0.911951 2013-01-02 -0.711591 2013-01-03 -0.041417 2013-01-04 1.222247 2013-01-05 -0.332192 2013-01-06 -0.724522 Freq: D, Name: A, dtype: float64
通过切片取值的时候返回多行数据的df对象。[理解为被切片的df对象],切片为取头取尾。
In [229]: df[0:3] Out[229]: A B C D 2013-01-01 0.911951 0.119077 2.244598 -1.524168 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [230]: df["20130102":'20130103'] Out[230]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356
通过便签取值,国内书中有说是显式取值。通过loc的方法,注意调用该方法时不用()直接后面加[]就可以.
In [238]: df.loc[dates[0]] Out[238]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64
取多列坐标数据
In [239]: df.loc[:,['A','B']] Out[239]: A B 2013-01-01 0.911951 0.119077 2013-01-02 -0.711591 1.814327 2013-01-03 -0.041417 -1.158472 2013-01-04 1.222247 -0.651681 2013-01-05 -0.332192 -1.424871 2013-01-06 -0.724522 -1.238991
通过切片选择不同的行与列的信息
In [240]: df.loc["20130102":"20130104",["A","B"]] Out[240]: A B 2013-01-02 -0.711591 1.814327 2013-01-03 -0.041417 -1.158472 2013-01-04 1.222247 -0.651681
只选取一行,选取指定列
In [241]: df.loc["20130102",["A","B"]] Out[241]: A -0.711591 B 1.814327 Name: 2013-01-02 00:00:00, dtype: float64
最后介绍两种方式取出单个数值,也就是所谓的标量
In [242]: df.loc[dates[0],"A"] Out[242]: 0.91195089904327 In [243]: df.at[dates[0],"A"] Out[243]: 0.91195089904327 In [244]:
通过位置来取值,也就是所谓的隐式传参 iloc
跟loc使用差不多,可以传入单值,多值,还有切片
单值,返回某一行的数据的Series对象
In [244]: df.iloc[0] Out[244]: A 0.911951 B 0.119077 C 2.244598 D -1.524168 Name: 2013-01-01 00:00:00, dtype: float64 In [245]: df.iloc[3] Out[245]: A 1.222247 B -0.651681 C 1.764630 D 0.500507 Name: 2013-01-04 00:00:00, dtype: float64
也可以通过切片,这是跟Python一样,取头不取尾的
In [246]: df.iloc[3:5,0:2] Out[246]: A B 2013-01-04 1.222247 -0.651681 2013-01-05 -0.332192 -1.424871
还可以传入多个位置参数来取,位置参数之间用逗号分割
In [247]: df.iloc[[1,2,4],[0,2]] Out[247]: A C 2013-01-02 -0.711591 0.859623 2013-01-03 -0.041417 1.037752 2013-01-05 -0.332192 -0.680254
可以通过单个:来全选数据
In [248]: df.iloc[1:3] Out[248]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [249]: df.iloc[1:3,:] Out[249]: A B C D 2013-01-02 -0.711591 1.814327 0.859623 -0.249116 2013-01-03 -0.041417 -1.158472 1.037752 1.124356 In [250]: df.iloc[:,1:3] Out[250]: B C 2013-01-01 0.119077 2.244598 2013-01-02 1.814327 0.859623 2013-01-03 -1.158472 1.037752 2013-01-04 -0.651681 1.764630 2013-01-05 -1.424871 -0.680254 2013-01-06 -1.238991 0.223425
也可以通过输入两个数值坐标来取标量。
In [251]: df.iloc[1,1] Out[251]: 1.8143274155708045 In [252]: df.iat[1,1] Out[252]: 1.8143274155708045
bolllean 索引,有些书称为掩码取值
使用单一的列数据来选择数据
In [8]: df[df['A'] > 0] Out[8]: A B C D 2013-01-02 0.314732 1.187635 -0.290305 -1.091247 2013-01-03 0.535155 0.453357 0.308055 -0.533740 2013-01-04 0.688215 -1.474558 -2.561580 -1.211188 2013-01-06 1.104539 -1.272097 1.423444 0.417609
针对df对象所有对象进行选值
In [9]: df[df>0] Out[9]: A B C D 2013-01-01 NaN 0.954575 NaN 0.070635 2013-01-02 0.314732 1.187635 NaN NaN 2013-01-03 0.535155 0.453357 0.308055 NaN 2013-01-04 0.688215 NaN NaN NaN 2013-01-05 NaN NaN 1.522803 NaN 2013-01-06 1.104539 NaN 1.423444 0.417609
使用isin()来选择数据
In [14]: df2 Out[14]: A B C D E 2013-01-01 -1.629264 0.954575 -0.102982 0.070635 one 2013-01-02 0.314732 1.187635 -0.290305 -1.091247 one 2013-01-03 0.535155 0.453357 0.308055 -0.533740 two 2013-01-04 0.688215 -1.474558 -2.561580 -1.211188 three 2013-01-05 -1.579950 -0.317184 1.522803 -0.702557 four 2013-01-06 1.104539 -1.272097 1.423444 0.417609 three In [15]: df2[df2['E'].isin(['two','four'])] Out[15]: A B C D E 2013-01-03 0.535155 0.453357 0.308055 -0.533740 two 2013-01-05 -1.579950 -0.317184 1.522803 -0.702557 four
设定
设置一个新的列将自动对齐数据的索引。
In [16]: s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20130102',periods=6)) In [17]: s1 Out[17]: 2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64 In [18]: df['F'] = s1
分别通过便签与索引设定值
In [19]: df.at[dates[0],"A"] = 0 In [20]: df.iat[0,1] = 0
通过NumPy的数组来分配数据
In [26]: df.loc[:,"D"] = np.array([5] * len(df)) In [27]: df Out[27]: A B C D F 2013-01-01 0.000000 0.000000 -0.102982 5 NaN 2013-01-02 0.314732 1.187635 -0.290305 5 1.0 2013-01-03 0.535155 0.453357 0.308055 5 2.0 2013-01-04 0.688215 -1.474558 -2.561580 5 3.0 2013-01-05 -1.579950 -0.317184 1.522803 5 4.0 2013-01-06 1.104539 -1.272097 1.423444 5 5.0
操作具体选定的数据
In [30]: df2 Out[30]: A B C D F 2013-01-01 0.000000 0.000000 -0.102982 5 NaN 2013-01-02 0.314732 1.187635 -0.290305 5 1.0 2013-01-03 0.535155 0.453357 0.308055 5 2.0 2013-01-04 0.688215 -1.474558 -2.561580 5 3.0 2013-01-05 -1.579950 -0.317184 1.522803 5 4.0 2013-01-06 1.104539 -1.272097 1.423444 5 5.0 In [31]: df2[df2>0] = -df2 In [32]: df2 Out[32]: A B C D F 2013-01-01 0.000000 0.000000 -0.102982 -5 NaN 2013-01-02 -0.314732 -1.187635 -0.290305 -5 -1.0 2013-01-03 -0.535155 -0.453357 -0.308055 -5 -2.0 2013-01-04 -0.688215 -1.474558 -2.561580 -5 -3.0 2013-01-05 -1.579950 -0.317184 -1.522803 -5 -4.0 2013-01-06 -1.104539 -1.272097 -1.423444 -5 -5.0
缺失数据
pandas主要使用np.nan来表示缺失值,一般情况下它不参与计算。
使用Reindexing可以改变指定象限的索引值。
In [17]: df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ["E"]) In [18]: df1 Out[18]: A B C D F E 2013-01-01 0.000000 0.000000 0.009247 5 NaN NaN 2013-01-02 1.012606 0.986936 1.007574 5 1.0 NaN 2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 NaN 2013-01-04 1.957053 1.212591 0.639196 5 3.0 NaN In [19]: df1.loc[dates[0]:dates[1],'E'] = 1 In [20]: df1 Out[20]: A B C D F E 2013-01-01 0.000000 0.000000 0.009247 5 NaN 1.0 2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0 2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 NaN 2013-01-04 1.957053 1.212591 0.639196 5 3.0 NaN
放弃任何有nan的行
In [21]: df1.dropna(how='any') Out[21]: A B C D F E 2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0
填写缺失的数据
In [22]: df1.fillna(value=5) Out[22]: A B C D F E 2013-01-01 0.000000 0.000000 0.009247 5 5.0 1.0 2013-01-02 1.012606 0.986936 1.007574 5 1.0 1.0 2013-01-03 -0.659537 -0.193782 -0.037052 5 2.0 5.0 2013-01-04 1.957053 1.212591 0.639196 5 3.0 5.0
通过bool掩码值获取是否是nan的df对象
In [23]: pd.isna(df1) Out[23]: A B C D F E 2013-01-01 False False False False True False 2013-01-02 False False False False False False 2013-01-03 False False False False False True 2013-01-04 False False False False False True
Operations[操作]
Stats[操作]
操作一般不包括缺失的数据,进行描述性统计
In [12]: df Out[12]: A B C D F 2013-01-01 0.000000 0.000000 0.913359 5 NaN 2013-01-02 -1.970074 0.380875 -0.648481 5 1.0 2013-01-03 0.281023 0.861409 0.084303 5 2.0 2013-01-04 -0.795286 1.961307 1.569313 5 3.0 2013-01-05 0.974005 0.648449 -1.928545 5 4.0 2013-01-06 0.000812 -0.274185 -0.469916 5 5.0 In [13]: df.mean() Out[13]: A -0.251587 B 0.596309 C -0.079994 D 5.000000 F 3.000000 dtype: float64 In [14]: df.mean(1) Out[14]: 2013-01-01 1.478340 2013-01-02 0.752464 2013-01-03 1.645347 2013-01-04 2.147067 2013-01-05 1.738782 2013-01-06 1.851342 Freq: D, dtype: float64
针对不同的象限取平均。
操作不同维度和需要对齐的对象。pandas将会自动沿指定的维度广播。
In [27]: s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2) In [28]: s Out[28]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64 In [29]: df.sub? In [30]: df.sub(s,axis='index') Out[30]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -0.718977 -0.138591 -0.915697 4.0 1.0 2013-01-04 -3.795286 -1.038693 -1.430687 2.0 0.0 2013-01-05 -4.025995 -4.351551 -6.928545 0.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
Apply
申请
Applying functions to the data:
将函数应用与数据
In [31]: df Out[31]: A B C D F 2013-01-01 0.000000 0.000000 0.913359 5 NaN 2013-01-02 -1.970074 0.380875 -0.648481 5 1.0 2013-01-03 0.281023 0.861409 0.084303 5 2.0 2013-01-04 -0.795286 1.961307 1.569313 5 3.0 2013-01-05 0.974005 0.648449 -1.928545 5 4.0 2013-01-06 0.000812 -0.274185 -0.469916 5 5.0 In [32]: df.apply(np.cumsum) Out[32]: A B C D F 2013-01-01 0.000000 0.000000 0.913359 5 NaN 2013-01-02 -1.970074 0.380875 0.264878 10 1.0 2013-01-03 -1.689051 1.242284 0.349182 15 3.0 2013-01-04 -2.484337 3.203590 1.918495 20 6.0 2013-01-05 -1.510332 3.852039 -0.010051 25 10.0 2013-01-06 -1.509519 3.577854 -0.479966 30 15.0 In [33]: df.apply? In [34]: df.apply(lambda x: x.max() - x.min()) Out[34]: A 2.944080 B 2.235492 C 3.497859 D 0.000000 F 4.000000 dtype: float64 In [35]:
默认对列数据进行操作,添加一个函数。
Histogramming¶
柱状图
See more at Histogramming and Discretization.
In [35]: s = pd.Series(np.random.randint(0,7,size=10)) In [36]: s Out[36]: 0 1 1 0 2 1 3 3 4 4 5 0 6 0 7 5 8 0 9 1 dtype: int64 In [37]: s.value_counts() Out[37]: 0 4 1 3 5 1 4 1 3 1 dtype: int64 In [38]: s.value_counts().__class__ Out[38]: pandas.core.series.Series In [39]: s.value_counts().index Out[39]: Int64Index([0, 1, 5, 4, 3], dtype='int64')
前面的索引时出现的数字,后面是出现的次数。
String Methods
字符串方法
Series is equipped with a set of string processing methods in the str
attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str
generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
Series在str属性中配备了一组字符串处理方法,这些方法可以轻松地对数组的每个元素进行操作,
In [40]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"]) In [41]: s.str.lower() Out[41]: 0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object In [42]:
Merge¶
合并
Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat()
:
用concat连接pandas对象
In [6]: df = pd.DataFrame(np.random.randn(10, 4),index=[chr(i) for i in range(65,75)]) In [7]: df Out[7]: 0 1 2 3 A -0.352374 1.278078 0.543435 0.085902 B -0.572048 -0.053377 1.278151 -0.143507 C 0.058471 -0.678905 -0.928697 -0.275203 D -0.338097 2.616637 -1.541948 0.127263 E -0.460966 -1.571678 -1.958119 -0.533291 F -0.832041 -0.103917 1.635657 0.887229 G -0.171535 -2.181684 -1.682033 0.462375 H -1.001720 0.925746 -0.205054 -1.493562 I -0.213802 -1.214741 -0.864377 0.735728 J -0.375842 0.406692 0.207853 0.626410 In [8]: prices = [df[:3],df[3:7],df[7:]] In [9]: prices Out[9]: [ 0 1 2 3 A -0.352374 1.278078 0.543435 0.085902 B -0.572048 -0.053377 1.278151 -0.143507 C 0.058471 -0.678905 -0.928697 -0.275203, 0 1 2 3 D -0.338097 2.616637 -1.541948 0.127263 E -0.460966 -1.571678 -1.958119 -0.533291 F -0.832041 -0.103917 1.635657 0.887229 G -0.171535 -2.181684 -1.682033 0.462375, 0 1 2 3 H -1.001720 0.925746 -0.205054 -1.493562 I -0.213802 -1.214741 -0.864377 0.735728 J -0.375842 0.406692 0.207853 0.626410] In [10]: pd.concat(prices) Out[10]: 0 1 2 3 A -0.352374 1.278078 0.543435 0.085902 B -0.572048 -0.053377 1.278151 -0.143507 C 0.058471 -0.678905 -0.928697 -0.275203 D -0.338097 2.616637 -1.541948 0.127263 E -0.460966 -1.571678 -1.958119 -0.533291 F -0.832041 -0.103917 1.635657 0.887229 G -0.171535 -2.181684 -1.682033 0.462375 H -1.001720 0.925746 -0.205054 -1.493562 I -0.213802 -1.214741 -0.864377 0.735728 J -0.375842 0.406692 0.207853 0.626410
Join
SQL style merges. See the Database style joining section.
类似与SQL样式的合并
In [21]: left Out[21]: key lval 0 foo 1 1 foo 2 In [22]: right Out[22]: key rval 0 foo 4 1 foo 5 In [23]: pd.merge(left, right, on="key") Out[23]: key lval rval 0 foo 1 4 1 foo 1 5 2 foo 2 4 3 foo 2 5
类似与sql里面的join的内联
另一种形式的样式
In [24]: left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]}) In [25]: right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]}) ...: In [26]: pd.merge(left, right, on="key") Out[26]: key lval rval 0 foo 1 4 1 bar 2 5 In [27]: left Out[27]: key lval 0 foo 1 1 bar 2 In [28]: right Out[28]: key rval 0 foo 4 1 bar 5
我的理解与测试,需要两个相同的列名,取两个列名中的交集。
Grouping¶
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
- 根据某些条件将数据分成不同的组
Applying a function to each group independently
- 将功能独立应用到每一个组
Combining the results into a data structure
- 将结果合并成一个数据结构
See the Grouping section.
In [33]: df = pd.DataFrame( ...: ....: { ...: ....: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ...: ....: "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ...: ....: "C": np.random.randn(8), ...: ....: "D": np.random.randn(8), ...: ....: } ...: ....: ) In [34]: df Out[34]: A B C D 0 foo one 0.964528 -0.464514 1 bar one 2.223520 1.743989 2 foo two -0.311332 0.242599 3 bar three -1.119041 -1.223514 4 foo two -0.879870 -1.140529 5 bar two -1.158119 -0.577413 6 foo one 0.349084 -0.770265 7 foo three -0.063142 -2.306210 In [35]: df.groupby('A') Out[35]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87e4b64d60> In [36]: df.groupby('A').count() Out[36]: B C D A bar 3 3 3 foo 5 5 5 In [37]: df.groupby('A').sum() Out[37]: C D A bar -0.053639 -0.056937 foo 0.059268 -4.438920
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum()
function.
In [38]: df.groupby(["A", "B"]).sum() Out[38]: C D A B bar one 2.223520 1.743989 three -1.119041 -1.223514 two -1.158119 -0.577413 foo one 1.313612 -1.234779 three -0.063142 -2.306210 two -1.191203 -0.897931
Reshaping
See the sections on Hierarchical Indexing and Reshaping.
tuples Out[49]: [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')] In [50]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"]) In [52]: df = pd.DataFrame(np.random.randn(8,2),index=index,columns=['A','B']) In [53]: df Out[53]: A B first second bar one 1.468431 1.239477 two 1.134492 0.163859 baz one -0.646760 -1.188458 two 1.389973 -1.437824 foo one 0.438599 0.053046 two 0.588275 0.074167 qux one -0.197791 -1.983309 two 0.725947 0.305324 In [54]: df = df[:4] In [55]: df Out[55]: A B first second bar one 1.468431 1.239477 two 1.134492 0.163859 baz one -0.646760 -1.188458 two 1.389973 -1.437824
建了一份数据,用了联合索引,取了前面4条
The stack()
method “compresses” a level in the DataFrame’s columns.
我的理解是stack是将列的索引添加到行的索引进行叠加,使行索引变成多层索引[多一层索引],多列数据变成单列数据[少一层索引],unstack就使将行索引转换到列索引了。
In [76]: stacked = df.stack() In [77]: stacked Out[77]: first second bar one A 1.468431 B 1.239477 two A 1.134492 B 0.163859 baz one A -0.646760 B -1.188458 two A 1.389973 B -1.437824 dtype: float64 In [78]: stacked.index Out[78]: MultiIndex([('bar', 'one', 'A'), ('bar', 'one', 'B'), ('bar', 'two', 'A'), ('bar', 'two', 'B'), ('baz', 'one', 'A'), ('baz', 'one', 'B'), ('baz', 'two', 'A'), ('baz', 'two', 'B')], names=['first', 'second', None]) In [79]: stacked.__class__ Out[79]: pandas.core.series.Series
从输出可以看到,stack的方法,将df从df对象,转换成Series对象,本来两层的index,变成了三层
With a “stacked” DataFrame or Series (having a MultiIndex
as the index
), the inverse operation of stack()
is unstack()
, which by default unstacks the last level:
In [80]: stacked Out[80]: first second bar one A 1.468431 B 1.239477 two A 1.134492 B 0.163859 baz one A -0.646760 B -1.188458 two A 1.389973 B -1.437824 dtype: float64 In [81]: stacked.unstack() Out[81]: A B first second bar one 1.468431 1.239477 two 1.134492 0.163859 baz one -0.646760 -1.188458 two 1.389973 -1.437824 In [82]: stacked.unstack().__class__ Out[82]: pandas.core.frame.DataFrame In [83]: stacked.unstack(0) Out[83]: first bar baz second one A 1.468431 -0.646760 B 1.239477 -1.188458 two A 1.134492 1.389973 B 0.163859 -1.437824
有了stack的基础,unstack就更加好理解了,默认转移的是多层index的最后一层,但也可以指定多层index的索引。
Pivot tables¶
See the section on Pivot Tables.
数据透视表,根据我自身的理解,感觉就是在原有的数据上面,挑选合适的列选为index,合适的列的column,合适的列为值。In [84]: df = pd.DataFrame( ...: .....: { ...: .....: "A": ["one", "one", "two", "three"] * 3, ...: .....: "B": ["A", "B", "C"] * 4, ...: .....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2, ...: .....: "D": np.random.randn(12), ...: .....: "E": np.random.randn(12), ...: .....: } ...: .....: ) ...: .....: In [85]: pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"]) Out[85]: C bar foo A B one A -1.549085 0.117403 B -0.343714 -0.231218 C -0.890250 -1.735391 three A -0.590521 NaN B NaN -2.155207 C -0.771405 NaN two A NaN 1.263717 B 0.061399 NaN C NaN 0.912017 In [86]: pd.pivot_table(df, values="E", index=["A", "B"], columns=["C"]) Out[86]: C bar foo A B one A -0.908049 0.088130 B -0.661966 1.091079 C -1.096862 0.875454 three A -0.425327 NaN B NaN 0.474478 C -0.928048 NaN two A NaN -0.663361 B -1.199722 NaN C NaN -0.270919 In [87]: pd.pivot_table(df, values=["E",'D'], index=["A", "B"], columns=["C"]) Out[87]: D E C bar foo bar foo A B one A -1.549085 0.117403 -0.908049 0.088130 B -0.343714 -0.231218 -0.661966 1.091079 C -0.890250 -1.735391 -1.096862 0.875454 three A -0.590521 NaN -0.425327 NaN B NaN -2.155207 NaN 0.474478 C -0.771405 NaN -0.928048 NaN two A NaN 1.263717 NaN -0.663361 B 0.061399 NaN -1.199722 NaN C NaN 0.912017 NaN -0.270919
Time series¶
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.