Pandas数据规整学习笔记2

2 重塑和轴向旋转

2.1 重塑层次化索引

针对DataFrame,有stack和unstack。stack将数据的列“旋转”为行,unstack反之。

In [6]: data = DataFrame(np.arange(6).reshape(2,3),index=pd.Index(['Ohio','Color
   ...: ado'], name='state'),columns=pd.Index(['one','two','three'],name='number
   ...: '))

In [7]: data
Out[7]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [8]: data.stack()  # 把列转化为行,即形成一个层次化索引的Series
Out[8]:
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [9]: data.stack().unstack()  # 把层次化索引的Series转化为一个普通的DataFrame
Out[9]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5
In [10]: data.stack().unstack('number')
Out[10]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [11]: data.stack().unstack(1)
Out[11]:
number    one  two  three
state                    
Ohio        0    1      2
Colorado    3    4      5

In [12]: data.stack().unstack(0)
Out[12]:
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

In [13]: data.stack().unstack('state')  # 选择指定的列,进行旋转
Out[13]:
state   Ohio  Colorado
number                
one        0         3
two        1         4
three      2         5

对于Series也是一样,

In [14]: s1 = Series([0,1,2,3],index=list('abcd'))

In [15]: s2 = Series([4,5,6],index=list('cde'))

In [16]: data2 = pd.concat([s1,s2],keys=['one','two'])

In [17]: data2
Out[17]:
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [18]: data2.unstack()
Out[18]:
       a    b    c    d    e
one  0.0  1.0  2.0  3.0  NaN
two  NaN  NaN  4.0  5.0  6.0

In [19]: data2.unstack().stack()  # 默认丢弃nan值
Out[19]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [20]: data2.unstack().stack(dropna=False) # 不丢弃nan值
Out[20]:
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

2.2 将“长格式”转化为“宽格式”

时间序列的通常以“长格式”或“堆栈”格式存放在数据库或者CSV中。

In [28]: df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
    ...: 'B' : ['A', 'B', 'C'] * 4,'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'ba
    ...: r'] * 2,
    ...: 'D' : np.random.randn(12),'E' : np.random.randn(12)})

In [29]: df
Out[29]:
        A  B    C         D         E
0     one  A  foo -1.130142 -0.800142
1     one  B  foo  1.453914 -1.344615
2     two  C  foo  0.452666  0.103939
3   three  A  bar  0.053393  0.284834
4     one  B  bar -0.435699  0.885828
5     one  C  bar  0.492411 -1.280743
6     two  A  foo -1.819792 -1.654636
7   three  B  foo  0.364982  1.654728
8     one  C  foo -1.205436  0.712150
9     one  A  bar  2.696329  1.357930
10    two  B  bar -0.561867  1.326672
11  three  C  bar  0.782410 -0.600689

# 这样就是df表中以D作为数值域,A,B为行,C为列的数据视图。

In [30]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[30]:
C             bar       foo
A     B                    
one   A  2.696329 -1.130142
      B -0.435699  1.453914
      C  0.492411 -1.205436
three A  0.053393       NaN
      B       NaN  0.364982
      C  0.782410       NaN
two   A       NaN -1.819792
      B -0.561867       NaN
      C       NaN  0.452666

3 数据转换

前面两部分说的都是数据的重排,还有一类操作就是过滤、清理和转换。

3.1 移除重复数据

duplicated方法返回一个bool型Series,判断行是否为重复行。
drop_duplicates方法:返回一个删除了重复行的DataFrame

In [31]: data = DataFrame({'k1':['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})

In [32]: data
Out[32]:
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4

In [33]: data.duplicated()
Out[33]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [34]: data.duplicated('k2')  # 按照列进行判断
Out[34]:
0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [35]: data.drop_duplicates
Out[35]:
<bound method DataFrame.drop_duplicates of     k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4>

In [36]: data.drop_duplicates()  # 返回一个新的去重后的对象
Out[36]:
    k1  k2
0  one   1
2  one   2
3  two   3
5  two   4

In [37]: data  # 原对象不变
Out[37]:
    k1  k2
0  one   1
1  one   1
2  one   2
3  two   3
4  two   3
5  two   4
6  two   4

# 默认保留的是第一个重复值
In [41]: data.drop_duplicates(['k1','k2'])  #
Out[41]:
    k1  k2  v1
0  one   1   0
2  one   2   2
3  two   3   3
5  two   4   5

In [42]: data.drop_duplicates(['k1','k2'],take_last=True)
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: the take_last=True keyword is deprecated, use keep='last' instead
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[42]:
    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6

# 也可以指定保留最后一个值
In [43]: data.drop_duplicates(['k1','k2'],keep='last')
Out[43]:
    k1  k2  v1
1  one   1   1
2  one   2   2
4  two   3   4
6  two   4   6

In [44]: data.drop_duplicates(['k1','k2'],keep='first')
Out[44]:
    k1  k2  v1
0  one   1   0
2  one   2   2
3  two   3   3
5  two   4   5

3.2 利用函数或映射进行数据转换

映射可以是一个字典,key为源,value为目标。
如何转换呢?DataFrame通过apply或者applymap,Series通过map函数进行。

In [45]: data = DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','cor
    ...: ned beef','Bacon','pastrami','honey ham','nova lox'],'ounces':[4,3,12,6
    ...: ,7.5,8,3,5,6]})

In [46]: data
Out[46]:
          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

In [47]: meat_to_animal = {'bacon':'pig','pulled pork':'pig','pastrami':'cow','c
    ...: orned beef':'cow','honey ham':'pig','nova lox':'salmon'}

In [49]: data['food']
Out[49]:
0          bacon
1    pulled pork
2          bacon
3       Pastrami
4    corned beef
5          Bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

# 在data中food列,有些存在大写,所以需要处理一下。
In [50]: data['food'].map(str.lower).map(meat_to_animal)
Out[50]:
0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object


# 通过字典进行映射
In [51]: data['animal']=data['food'].map(str.lower).map(meat_to_animal)

In [52]: data
Out[52]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon


#通过函数进行映射
In [53]: data['animal']=data['food'].map(lambda x:meat_to_animal[(x.lower())])

In [54]: data
Out[54]:
          food  ounces  animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

3.3 替换值

fillna是填充nan值,可以看成是替换值的一种特殊情况。
replace方法提供一个简单的替换值得方法。

In [55]: data = Series([1,-999,2,-999,-1000,3])

In [56]: data
Out[56]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [57]: data.replace(-999,np.nan)
Out[57]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [58]: data.replace([-999,-1000],np.nan) # 列表 + 标量
Out[58]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [59]: data.replace([-999,-1000],[np.nan,0])  # 两个列表
Out[59]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [60]: data.replace({-999:np.nan,-1000:0})  # 字典形式
Out[60]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

3.4 重命名轴索引

# 修改源数据
In [62]: data
Out[62]:
0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: int64

In [63]: data.index = range(2,8)

In [64]: data
Out[64]:
2       1
3    -999
4       2
5    -999
6   -1000
7       3
dtype: int64

In [65]: data.index.map(lambda x:x+100)
Out[65]: array([102, 103, 104, 105, 106, 107])

In [67]: data.index = data.index.map(lambda x:x+100)

In [68]: data
Out[68]:
102       1
103    -999
104       2
105    -999
106   -1000
107       3
dtype: int64

# 创建一个新数据,使用rename
In [73]: data2 = DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'),index
    ...: =range(3))

In [74]: data2
Out[74]:
   a  b   c   d
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
In [77]: data2.rename(index={0:100})
Out[77]:
     a  b   c   d
100  0  1   2   3
1    4  5   6   7
2    8  9  10  11

In [78]: data2.rename(index={0:100},columns={'a':'ff'})
Out[78]:
     ff  b   c   d
100   0  1   2   3
1     4  5   6   7
2     8  9  10  11

3.5 离散化和面元划分

连续数据常常需要被拆分为离散数据。这就需要用cut函数。

In [79]: ages = range(100)

In [80]: bins = [18,25,35,60,100]

In [81]: pd.cut(ages,bins)
Out[81]:
[NaN, NaN, NaN, NaN, NaN, ..., (60, 100], (60, 100], (60, 100], (60, 100], (60, 100]]
Length: 100
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [82]: cuts = pd.cut(ages,bins)

In [83]: cuts[45]
Out[83]: '(35, 60]'

In [85]: cuts.dtype
Out[85]: category

这时,cut返回的是一个category对象,

In [86]: cuts.labels
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: 'labels' is deprecated. Use 'codes' instead
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[86]:
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,
        3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3], dtype=int8)

In [87]: cuts.levels
/Users/yangfeilong/anaconda/bin/ipython:1: FutureWarning: Accessing 'levels' is deprecated, use 'categories'
  #!/bin/bash /Users/yangfeilong/anaconda/bin/python.app
Out[87]: Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype='object')

In [88]: pd.value_counts(cuts) #查看每个种类的数量
Out[88]:
(60, 100]    39
(35, 60]     25
(25, 35]     10
(18, 25]      7
dtype: int64

# 使用right=False可以修改开端和闭端

In [89]: cuts = pd.cut(ages,bins,right=False)

In [90]: cuts
Out[90]:
[NaN, NaN, NaN, NaN, NaN, ..., [60, 100), [60, 100), [60, 100), [60, 100), [60, 100)]
Length: 100
Categories (4, object): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

In [91]: cuts = pd.cut(ages,bins,right=False,labels=list('abcd'))

In [92]: cuts
Out[92]:
[NaN, NaN, NaN, NaN, NaN, ..., d, d, d, d, d]
Length: 100
Categories (4, object): [a < b < c < d]

# 如果cut传入的是数字n,那么就会均分成n份。
In [94]: pd.cut(ages,5)
Out[94]:
[(-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], (-0.099, 19.8], ..., (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99]]
Length: 100
Categories (5, object): [(-0.099, 19.8] < (19.8, 39.6] < (39.6, 59.4] < (59.4, 79.2] < (79.2, 99]]

# qcut使用样本分位数进行划分,比cut更平均,获得大小基本相同的面元
In [95]: pd.qcut(ages,5)
Out[95]:
[[0, 19.8], [0, 19.8], [0, 19.8], [0, 19.8], [0, 19.8], ..., (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99], (79.2, 99]]
Length: 100
Categories (5, object): [[0, 19.8] < (19.8, 39.6] < (39.6, 59.4] < (59.4, 79.2] < (79.2, 99]]
In [96]: pd.value_counts(pd.qcut(ages,5))
Out[96]:
(79.2, 99]      20
(59.4, 79.2]    20
(39.6, 59.4]    20
(19.8, 39.6]    20
[0, 19.8]       20
dtype: int64

In [98]: ages = np.random.randn(1000)

In [99]: pd.value_counts(pd.qcut(ages,5))
Out[99]:
(0.883, 3.534]      200
(0.25, 0.883]       200
(-0.266, 0.25]      200
(-0.806, -0.266]    200
[-3.172, -0.806]    200
dtype: int64

In [100]: pd.value_counts(pd.cut(ages,5))
Out[100]:
(-0.489, 0.852]     481
(-1.831, -0.489]    283
(0.852, 2.193]      198
(-3.179, -1.831]     23
(2.193, 3.534]       15
dtype: int64


3.6 检测和过滤异常值

只看个例子:

In [101]: data = DataFrame(np.random.randn(1000,4))

In [102]: data
Out[102]:
            0         1         2         3
0    0.746898 -0.657217  0.358517  1.075503
1   -0.821582  1.408816  0.937318 -0.548889
2   -0.527847  0.749256 -0.524283 -0.977338
3    0.800118 -0.544609  0.481839  0.555546
...
996 -0.081920  0.100507 -0.431551 -1.832261
997 -1.069475  0.205978  0.918592  0.881498
998  1.749007  2.658384 -1.079657 -0.023890
999  1.488167 -0.061146  0.006189  0.070564

[1000 rows x 4 columns]

In [103]: data.describe()
Out[103]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.018474     0.030781     0.025966     0.041950
std       1.020634     0.982547     1.010164     0.948116
min      -3.150278    -3.422772    -3.406069    -2.575391
25%      -0.719646    -0.621461    -0.644949    -0.572558
50%      -0.021287     0.043128     0.027890     0.050492
75%       0.701195     0.711455     0.678644     0.665413
max       3.270054     2.983500     3.323021     2.925822


In [105]: data[(np.abs(data) > 3).any(1)]
Out[105]:
            0         1         2         3
16   1.450842  2.222636 -3.289273  1.265199
353 -3.067571  0.490152  0.536051  1.252143
531  1.518928  0.289636  3.323021 -0.695272
688  1.776682 -0.799736 -3.406069  0.632751
702  0.807953 -0.674675 -3.045511 -0.267613
704 -0.469205 -3.422772 -2.669146 -0.712407
772 -0.861015 -0.573432  3.012727  0.325476
808  3.270054 -0.035521  0.061092  1.263363
812 -3.150278 -0.734308  1.376565  0.671424

In [108]: data[np.abs(data) > 3] = 100

In [110]: data.describe()
Out[110]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean      0.284474     0.134204     0.529371     0.041950
std       5.564261     3.308565     7.123116     0.948116
min      -2.746446    -2.983108    -2.710984    -2.575391
25%      -0.714890    -0.619551    -0.635457    -0.572558
50%      -0.018859     0.048775     0.036321     0.050492
75%       0.710050     0.712375     0.686314     0.665413
max     100.000000   100.000000   100.000000     2.925822

In [111]: data[np.abs(data) > 3] = np.sign(data) * 3

In [113]: data.describe()
Out[113]:
                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.006526     0.037204     0.044371     0.041950
std       1.019295     0.980940     1.006138     0.948116
min      -2.746446    -2.983108    -2.710984    -2.575391
25%      -0.714890    -0.619551    -0.635457    -0.572558
50%      -0.018859     0.048775     0.036321     0.050492
75%       0.710050     0.712375     0.686314     0.665413
max       3.000000     3.000000     3.000000     2.925822

3.7 排列和随机采样

# 随机重排序
In [114]: df = DataFrame(np.arange(20).reshape(5,4))

In [115]: df.ix[np.random.permutation(5)]
Out[115]:
    0   1   2   3
0   0   1   2   3
4  16  17  18  19
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

# 获取随机前三行
In [116]: df.ix[np.random.permutation(len(df))[:3]]
Out[116]:
    0   1   2   3
0   0   1   2   3
2   8   9  10  11
4  16  17  18  19

# 随机获取10,每个值随机取自bag
In [118]: bag = np.array([5,7,-1,4,6])

In [119]: bag.take(np.random.randint(0,len(bag),size=10))
Out[119]: array([ 7,  7,  7,  5,  7,  6,  4, -1,  6,  4])

3.8 计算标量/哑变量

还有一种常用的转换方式是:将分类变量转换为哑变量矩阵或指标矩阵。

# 比较复杂,其实就是用1和0填充一个矩阵,说明元素在一个Series中的位置。
In [120]: s1 = Series(list('bbacab'))

In [121]: pd.get_dummies(s1)
Out[121]:
     a    b    c
0  0.0  1.0  0.0
1  0.0  1.0  0.0
2  1.0  0.0  0.0
3  0.0  0.0  1.0
4  1.0  0.0  0.0
5  0.0  1.0  0.0

结合cut和get_dummies函数可以来做一些统计有用的事情:

In [122]: values = np.random.rand(10)

In [123]: values
Out[123]:
array([ 0.96294606,  0.15142481,  0.38260636,  0.01174038,  0.44699216,
        0.3726047 ,  0.50965999,  0.61532174,  0.27740518,  0.28994142])

In [126]: bins = [0,0.2,0.4,0.6,0.8,1.]

In [127]: pd.cut(values,bins)
Out[127]:
[(0.8, 1], (0, 0.2], (0.2, 0.4], (0, 0.2], (0.4, 0.6], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.2, 0.4], (0.2, 0.4]]
Categories (5, object): [(0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]

In [128]: pd.get_dummies(pd.cut(values, bins)) # 查看每个类型的指标矩阵
Out[128]:
   (0, 0.2]  (0.2, 0.4]  (0.4, 0.6]  (0.6, 0.8]  (0.8, 1]
0       0.0         0.0         0.0         0.0       1.0
1       1.0         0.0         0.0         0.0       0.0
2       0.0         1.0         0.0         0.0       0.0
3       1.0         0.0         0.0         0.0       0.0
4       0.0         0.0         1.0         0.0       0.0
5       0.0         1.0         0.0         0.0       0.0
6       0.0         0.0         1.0         0.0       0.0
7       0.0         0.0         0.0         1.0       0.0
8       0.0         1.0         0.0         0.0       0.0
9       0.0         1.0         0.0         0.0       0.0

In [129]: a =pd.get_dummies(pd.cut(values, bins))

In [131]: a.sum(axis=0)  # 统计一下每个类型的数量
Out[131]:
(0, 0.2]      2.0
(0.2, 0.4]    4.0
(0.4, 0.6]    2.0
(0.6, 0.8]    1.0
(0.8, 1]      1.0
dtype: float64

待续。。。

posted @ 2017-02-06 14:23  felo  阅读(1791)  评论(0编辑  收藏  举报