1 4

pandas模块学习笔记2--基本功能

一、重新索引

obj = Series([1,2,3,4],index=['a','b','c','d'])
输出为:
a    1
b    2
c    3
d    4
Series有一个reindex函数,可以将索引重排,以致元素顺序发生变化
obj.reindex(['a','c','d','b','e'],fill_value = 0)  #fill_value 填充空的index的值
输出为:
a    1
c    3
d    4
b    2
e    0
obj2 = Series(['red','blue'],index=[0,4])  
输出为:
0     red
4    blue

obj2.reindex(range(6),method='ffill')   #method = ffill,意味着前向值填充 
输出为:
0     red
1     red
2     red
3     red
4    blue
5    blue
对于DataFrame,reindex可以修改行(索引)、列或者两个都改。
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 
输出为:
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

frame2 = frame.reindex(['a','b','c','d'])   #只是传入一列数,是对行进行reindex 
输出为:
   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   3.0    4.0         5.0
d   6.0    7.0         8.0

frame4 = frame.reindex(columns=states)  # 使用columns关键字即可重新索引列
输出为:
   Texas  Utah  California
a      1   NaN           2
c      4   NaN           5
d      7   NaN           8

frame5 = frame.reindex(index = ['a','b','c','d'],columns=states)  #同时对行、列进行重新索引
输出为:
   Texas  Utah  California
a    1.0   NaN         2.0
b    NaN   NaN         NaN
c    4.0   NaN         5.0
d    7.0   NaN         8.0

二、丢弃指定轴上的项:

obj = Series(np.arange(3.),index = ['a','b','c']) 
输出为:
a    0.0
b    1.0
c    2.0

obj.drop(['a','b'])
输出为:
c    2.0
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 
输出为:
   Ohio  Texas  California
a     0      1           2
c     3      4           5
d     6      7           8

frame.drop(['a'])  #删除行
输出为:
   Ohio  Texas  California
c     3      4           5
d     6      7           8

frame.drop(['Ohio'],axis = 1)  #删除列
输出为:
   Texas  California
a      1           2
c      4           5
d      7           8

三、索引、选取和过滤

Series的索引的工作方式类似于Numpy数组的索引,只不过Series的索引值不只是整数。

obj = Series([1,2,3,4],index=['a','b','c','d'])   
>>>
a    1
b    2
c    3
d    4

obj['b']
>>> 2

obj[1]
>>> 2

obj[0:3]
>>>
a    1
b    2
c    3

obj[[0,3]]
>>>
a    1
d    4

obj[obj<2]
>>>a    1

利用标签的切片运算与普通的Python切片运算不同,其末端是包含的,即封闭区间:

obj['b':'d']
>>>
b    2
c    3
d    4

DataFrame索引:对DataFrame进行索引就是获取一个或多个列:

frame = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four'])
>>>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15

frame['two']
>>>
Ohio         1
Colorado     5
Utah         9
New York    13

frame[:2]     # 通过切片选得到的是行
>>>
          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7

四、算术运算和数据对齐

pandas最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。

s1 = Series([1,2,3],['a','b','c'])             
s2 = Series([4,5,6],['b','c','d'])             
s1 + s2                               
>>>
a    NaN
b    6.0
c    8.0
d    NaN

对于DataFrame,对齐操作会同时发生在行和列上:

df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))   
df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))  

df1
>>>
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

df2
>>>
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0

df1 + df2
>>>
      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN
下面看一下DataFrame和Series之间的计算过程:
arr = DataFrame(np.arange(12.).reshape((3,4)),columns = list('abcd'))

arr
>>>
     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0

Series = arr.ix[0]  #如果写arr[0]是错的,因为只有标签索引函数ix后面加数字才表示行
>>>
a    0.0
b    1.0
c    2.0
d    3.0

arr - Series  #默认情况下,DataFrame和Series的计算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
>>>
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  4.0  4.0  4.0  4.0
2  8.0  8.0  8.0  8.0

Series2 = Series(range(3),index = list('cdf'))
>>>
c    0
d    1
f    2

arr + Series2   # #按照规则,在不匹配的列会形成NaN值
>>>
    a   b     c     d   f
0 NaN NaN   2.0   4.0 NaN
1 NaN NaN   6.0   8.0 NaN
2 NaN NaN  10.0  12.0 NaN

Series3 = arr['d']
>>>
0     3.0
1     7.0
2    11.0

# 如果想匹配行且在列上广播,需要用到算术运算方法
# 传入的轴号就是希望匹配的轴,这里是匹配行索引并进行广播
# axis = 0 表示按照第0轴  二维情况下表示列
arr.sub(Series3,axis = 0)
>>>
     a    b    c    d
0 -3.0 -2.0 -1.0  0.0
1 -3.0 -2.0 -1.0  0.0
2 -3.0 -2.0 -1.0  0.0

五、函数应用和映射

 

 

posted @ 2017-09-26 23:27  韦木三  阅读(341)  评论(0编辑  收藏  举报