pandas(七)数据规整化:清理、转换、合并、重塑之合并数据集
pandas对象中的数据可以通过一些内置的方式进行合并:
pandas.merge 可根据一个或多个键将不同的DataFrame中的行连接起来。
pandas.concat可以沿着一条轴将多个对象堆叠到一起
实例的方法conbine_first 可以将重复的数据编接到一起,用一个对象中的值填充另一个对象的缺失值。
数据库风格的DataFrame合并
In [51]: df1 = DataFrame({'key':['b','b','a','c','a','a','b'],'data1':range(7)}) In [53]: df2 = DataFrame({'key':['a','b','d'],'data2':range(3)}) In [54]: df1 Out[54]: data1 key 0 0 b 1 1 b 2 2 a 3 3 c 4 4 a 5 5 a 6 6 b In [55]: df2 Out[55]: data2 key 0 0 a 1 1 b 2 2 d In [56]: pd.merge(df1,df2) Out[56]: data1 key data2 0 0 b 1 1 1 b 1 2 6 b 1 3 2 a 0 4 4 a 0 5 5 a 0
默认不显式指明用哪个键进行连接的时候,merge会将重叠列的列名当做键。不过,最好显式的指明:
In [57]: pd.merge(df1,df2,on = 'key')
如果两个列的列名不同,可以分别指定
In [58]: df4 = DataFrame({'key-r':['a','b','d'],'data2':range(3)}) In [59]: df3 = DataFrame({'key-l':['b','b','a','c','a','a','b'],'data1':range(7)}) In [63]: pd.merge(df3,df4,left_on = 'key-l',right_on = 'key-r') Out[63]: data1 key-l data2 key-r 0 0 b 1 b 1 1 b 1 b 2 6 b 1 b 3 2 a 0 a 4 4 a 0 a 5 5 a 0 a
默认情况下,merge做的是‘inner’连接,merge还有left、right和outer:
In [64]: pd.merge(df1,df2,how='outer') Out[64]: data1 key data2 0 0.0 b 1.0 1 1.0 b 1.0 2 6.0 b 1.0 3 2.0 a 0.0 4 4.0 a 0.0 5 5.0 a 0.0 6 3.0 c NaN 7 NaN d 2.0
如果要根据多个键进行合并,传入一个由列名组成的列表即可。
In [65]: left = DataFrame({'key1':['a','a','b'],'key2':['one','two','one'],'lrow':[1,2,3]}) In [66]: right = DataFrame({'key1':['a','a','b','b'],'key2':['one','one','one','two'],'rrow':[4,5,6,7]}) In [67]: pd.merge(left,right,on = ['key1','key2']) Out[67]: key1 key2 lrow rrow 0 a one 1 4 1 a one 1 5 2 b one 3 6
如果合并的列中存在重复的列,可以指定重复的列的后缀进行区分
In [68]: pd.merge(left,right,on = ['key1']) Out[68]: key1 key2_x lrow key2_y rrow 0 a one 1 one 4 1 a one 1 one 5 2 a two 2 one 4 3 a two 2 one 5 4 b one 3 one 6 5 b one 3 two 7 #指定列名 In [71]: pd.merge(left,right,on = ['key1'],suffixes=['_eft','_right']) Out[71]: key1 key2_eft lrow key2_right rrow 0 a one 1 one 4 1 a one 1 one 5 2 a two 2 one 4 3 a two 2 one 5 4 b one 3 one 6 5 b one 3 two 7
索引上的合并
有时候,DataFrame中的连接键位于其索引中,merge可以通过left_index = True 或 right_index =True 来讲索引应用于连接键。
In [73]: left = DataFrame({'key':['a','b','a','a','b','c'],'value':range(6)}) In [75]: right = DataFrame({'group_val':[3.5,7]},index=['a','b']) In [76]: pd.merge(left,right,left_on = 'key',right_index= True) Out[76]: key value group_val 0 a 0 3.5 2 a 2 3.5 3 a 3 3.5 1 b 1 7.0 4 b 4 7.0
对于层次化的索引,需要对应的left_on = ['key1','key2'] , right_index = True
pandas的数据合并
轴向连接
另一种合并运算也叫做连接、绑定或堆叠。
numpy有一个用于原始合并Numpy数组的concatenation函数
>>> arr = np.arange(12).reshape(3,4)
>>> arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> np.concatenate([arr,arr])
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> np.concatenate([arr,arr],axis =1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
pandas的concat函数提供了解决类似上述数组的链接操作。并且能够解决下述问题:
1、如果各个对象其他轴上的索引不同,合并的时候是采用交集还是并集。
2、结果对象中的分组需要各不相同吗?
3、用于连接的轴的重要性。
以Series为例
>>> s1 = Series(['0','1'],index = ['a','b'])
>>> s2 = Series([2,3,4],index = ['c','d','e'])
>>> s3 = Series({'f':5,'g':6})
>>> s1
a 0
b 1
dtype: object
>>> s3
f 5
g 6
dtype: int64
>>> pd.concat((s1,s2,s3))
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: object
>>>
默认情况下concat是在axis=0轴上工作的,生成一个新的Series。如果设定axis =1,就会生成一个DataFrame对象。
>>> pd.concat((s1,s2,s3),axis=1)
0 1 2
a 0 NaN NaN
b 1 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
上面的情况,都是另外一条轴上没有重叠
如果另外一条轴上有重叠的话,会发生什么呢?
>>> s4
a 00000
b 11111
f 5
g 6
dtype: object
>>> pd.concat((s1,s4),axis =1)
0 1
a 0 00000
b 1 11111
f NaN 5
g NaN 6
默认是取并集的,如果是要取交集,只需要传入join = ‘inner’
>>> pd.concat((s1,s4),axis =1,join='inner')
0 1
a 0 00000
b 1 11111
你可以通过join_axes指定要在其他轴上使用的索引。
>>> pd.concat((s1,s4),axis =1,join_axes = (['a','c','b','e'],))
0 1
a 0 00000
c NaN NaN
b 1 11111
e NaN NaN
在连接轴上创建一个层次化索引。用keys参数。
>>>
>>> result = pd.concat([s1,s1,s3],keys = ['one','two','three'])
>>> result
one a 0
b 1
two a 0
b 1
three f 5
g 6
dtype: object
如果定义axis的话,就会形成一个DataFrame对象,并且将keys作为列头。
对于DataFrame也一样,只不过会生成层次化索引,默认是行索引。
In [12]: df1 = DataFrame(np.arange(6).reshape(3,2),index = ['a','b','c'],columns = ['one','two'])
In [13]: df2 =DataFrame(np.arange(4).reshape(2,2),index= ['a','c'],columns = ['three','four'])
In [14]: df1
Out[14]:
one two
a 0 1
b 2 3
c 4 5
In [15]: df2
Out[15]:
three four
a 0 1
c 2 3
In [16]: pd.concat([df1,df2],keys = ['item1','item2'])
Out[16]:
four one three two
item1 a NaN 0.0 NaN 1.0
b NaN 2.0 NaN 3.0
c NaN 4.0 NaN 5.0
item2 a 1.0 NaN 0.0 NaN
c 3.0 NaN 2.0 NaN
In [17]: pd.concat([df1,df2],keys = ['item1','item2'],axis = 1)
Out[17]:
item1 item2
one two three four
a 0 1 0.0 1.0
b 2 3 NaN NaN
c 4 5 2.0 3.0
如果传入的不是一个列表,而是一个字典,则字典的键就会被当做keys选项的值
In [21]: pd.concat({'level1':df1,'level2':df2},axis =1)
Out[21]:
level1 level2
one two three four
a 0 1 0.0 1.0
b 2 3 NaN NaN
c 4 5 2.0 3.0
此外,还有两个管理层次化索引创建方式的参数,name和levels
连接轴上的索引与分析工作无关的话,可以用ignore_index忽略即可。
In [25]: df1 = DataFrame(np.random.randn(3,4),columns=['a','b','c','d'])
In [26]: df2 = DataFrame(np.random.randn(2,3),columns = ['b','d','a'])
In [27]: pd.concat([df1,df2])
Out[27]:
a b c d
0 -0.859642 -0.637505 -0.830966 0.378526
1 -0.614132 1.692308 0.600534 -0.879568
2 -0.004405 -1.271789 -0.825860 0.528685
0 -1.219479 -0.626997 NaN -2.449499
1 0.366940 1.175266 NaN 0.434918
In [28]: pd.concat([df1,df2],ignore_index = True)
Out[28]:
a b c d
0 -0.859642 -0.637505 -0.830966 0.378526
1 -0.614132 1.692308 0.600534 -0.879568
2 -0.004405 -1.271789 -0.825860 0.528685
3 -1.219479 -0.626997 NaN -2.449499
4 0.366940 1.175266 NaN 0.434918
合并重叠数据:
有索引或部分索引重叠的两个数据集。如果要合并这两个数据集,可用Series和DataFrame的combine_first方法
In [32]: a = Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index = ['f','e','d','c','b','a'])
In [33]: b =Series(np.arange((len(a)),dtype = np.float64),index = ['f','e','d','c','b','a'])
In [34]: b[:-2]
Out[34]:
f 0.0
e 1.0
d 2.0
c 3.0
dtype: float64
In [35]: a[2:]
Out[35]:
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
In [36]: b[:-2].combine_first(a[2:])
Out[36]:
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
In [37]: b
Out[37]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
In [38]: b.combine_first(a)
Out[38]:
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a 5.0
dtype: float64
In [39]: a.combine_first(b)
Out[39]:
f 0.0
e 2.5
d 2.0
c 3.5
b 4.5
a 5.0
dtype: float64
combine_first方法可以给调用者的缺失值数据“打补丁”,即用被调用者的非na值代替调用者的缺失值。