数据处理之pandas简单介绍

Offical Website :http://pandas.pydata.org/

 

一:两种基本的数据类型结构 Series 和 DataFrame

先来看一下Series

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 
5 #Series
6 s = pd.Series([i*2 for i in xrange(1 , 11)])
7 print s

打印结果为:

0     2
1     4
2     6
3     8
4    10
5    12
6    14
7    16
8    18
9    20
dtype: int64

其中。前面的0--9是索引值,后面的2,4,6...是我们传递的 list 中的值。

 

然后看一下DataFrame

1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4 
5 #DataFrame
6 dates = pd.date_range('20170301' , periods = 8)
7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
8 print df

运行结果:

                   A         B         C         D         E
2017-03-01  1.446957 -0.969023 -0.272529  0.695884 -0.842616
2017-03-02 -0.193140  1.231356  0.761668 -0.859277 -1.002324
2017-03-03 -0.441364  1.059026  0.392266  1.180888  0.144625
2017-03-04  0.510129  0.851746  0.110843  0.745591 -0.724988
2017-03-05  0.417613 -0.640111 -1.048320  1.605048  0.935129
2017-03-06  0.805600  0.491515  0.042078  0.081229 -0.293101
2017-03-07 -1.597687  0.268910  1.078853 -1.488760 -1.881305
2017-03-08 -2.414063  1.147526  0.143332  0.622884  1.760944

其中,第一个参数 np.random.randn(8 , 5)  会返回一个8行5列的 array , 其中的元素值为满足标准正态分布的随机数

第二个参数 index = dates (dates 是一个数组)传递了DataFrame 的索引值

第三个参数 columns = list('ABCDE') 传递了这个 DataFrame 对象每一列的标签

 

另外。DataFrame 接收的参数还可以是一个字典。key 对应列的标签,value 对应列的元素值。具体有多少行根据 每一个key 中 value 值最多的来确定

df = pd.DataFrame({'A':1,'B':[1,2,3,4]})
print df



   A  B
0  1  1
1  1  2
2  1  3
3  1  4

可以看到。Series 是 DataFrame 中的一个组成部分,或者说是一种特殊的 DataFrame。DataFrame 又是许多 Series 的集合。

 

二:DataFrame的基本操作

df.head(n = 5)返回原 df 对象的前 n 行。n 默认为5

df.tail(n = 5)返回原 df 对象的后 n 行。n 默认为5

df.index 返回 df 对象的索引值

 1 import pandas as pd
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 
 5 #DataFrame
 6 dates = pd.date_range('20170301' , periods = 8)
 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
 8 #print df
 9 
10 #Basic
11 print 'Head'
12 print df.head(3)
13 print 'Tail'
14 print df.tail(3)
15 print 
16 print df.index
Head
                   A         B         C         D         E
2017-03-01  0.872154  0.887637  0.877745  0.170153 -0.595866
2017-03-02 -2.260319 -1.400152 -0.347347 -0.880254 -0.388510
2017-03-03 -0.032758  0.393881 -0.279599  1.904316 -1.292630
Tail
                   A         B         C         D         E
2017-03-06 -0.116548 -0.459674  0.671389 -0.536236  1.224103
2017-03-07 -0.067690  0.678551 -0.258071 -0.352931  0.415018
2017-03-08  0.006201  0.464584  0.141018 -0.076282 -0.638886

DatetimeIndex(['2017-03-01', '2017-03-02', '2017-03-03', '2017-03-04',
               '2017-03-05', '2017-03-06', '2017-03-07', '2017-03-08'],
              dtype='datetime64[ns]', freq='D')

 

 

df.values 返回 df 对象中的元素值。并且返回的对象类型是一个 numpy.ndarray

df.T 返回一个转置过的 df 对象(行列交换)

 1 import pandas as pd
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 
 5 #DataFrame
 6 dates = pd.date_range('20170301' , periods = 8)
 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
 8 
 9 print df.values
10 print df.T
[[ 0.08981458  2.35966602  0.00606022  0.08633954  1.05939747]
 [-1.05151225 -1.19768201  1.83672123  1.20769635 -0.30581458]
 [-0.17192213 -0.75261065  1.04369857 -0.14874237  2.07925093]
 [-0.94600881  0.68897204 -0.18006348 -1.39294212 -0.24695665]
 [ 0.7730522  -1.62446734 -1.35308009  2.97657871  0.56537233]
 [ 0.24186251  0.56652445 -0.00513021  0.14593751  0.07460181]
 [-1.52712564  0.79666412 -1.68573768  0.85084609  0.48469802]
 [ 1.49180784 -0.04688902 -0.89278834 -0.81667428 -0.15639693]]
2017-03-01 2017-03-02 2017-03-03 2017-03-04 2017-03-05 2017-03-06 \ A 0.089815 -1.051512 -0.171922 -0.946009 0.773052 0.241863 B 2.359666 -1.197682 -0.752611 0.688972 -1.624467 0.566524 C 0.006060 1.836721 1.043699 -0.180063 -1.353080 -0.005130 D 0.086340 1.207696 -0.148742 -1.392942 2.976579 0.145938 E 1.059397 -0.305815 2.079251 -0.246957 0.565372 0.074602 2017-03-07 2017-03-08 A -1.527126 1.491808 B 0.796664 -0.046889 C -1.685738 -0.892788 D 0.850846 -0.816674 E 0.484698 -0.156397

 

 

df.sort_values(['columns_name'] , ascending = (0/1))  排序。对指定的列(columns_name)按指定的方式( 0 降序 , 1 升序)来排序。

df.sort_index(axis = 0/1 , ascending = 0/1) 排序。可对行索引(axis = 0)和列索引(axis = 1)按照降序(ascending =0)和升序(ascending = 1)进行排序

 1 import pandas as pd
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 
 5 #DataFrame
 6 dates = pd.date_range('20170301' , periods = 8)
 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
 8 
 9 print df.sort_values('A' , ascending = 1)
10 print df.sort_index(axis = 0, ascending = 0)
11 print df.sort_index(axis = 1, ascending = 0)
                   A         B         C         D         E
2017-03-04 -0.885580  0.668956  0.007392  0.561356 -0.214626
2017-03-05 -0.135071 -1.049060 -1.305366  0.558175 -0.087092
2017-03-08  0.188582  0.492789 -1.364214  0.504932  1.241542
2017-03-02  0.306423 -1.595937  0.532442  0.773825  0.196982
2017-03-06  0.901500 -0.115927 -1.448039  1.733633 -0.805994
2017-03-01  0.951188 -2.335634  1.592160  0.166211 -0.716212
2017-03-03  1.654593  0.431696  0.084542  0.121351 -0.197380
2017-03-07  1.673910  0.799920 -0.010755 -0.959697 -0.498297
                   A         B         C         D         E
2017-03-08  0.188582  0.492789 -1.364214  0.504932  1.241542
2017-03-07  1.673910  0.799920 -0.010755 -0.959697 -0.498297
2017-03-06  0.901500 -0.115927 -1.448039  1.733633 -0.805994
2017-03-05 -0.135071 -1.049060 -1.305366  0.558175 -0.087092
2017-03-04 -0.885580  0.668956  0.007392  0.561356 -0.214626
2017-03-03  1.654593  0.431696  0.084542  0.121351 -0.197380
2017-03-02  0.306423 -1.595937  0.532442  0.773825  0.196982
2017-03-01  0.951188 -2.335634  1.592160  0.166211 -0.716212
                   E         D         C         B         A
2017-03-01 -0.716212  0.166211  1.592160 -2.335634  0.951188
2017-03-02  0.196982  0.773825  0.532442 -1.595937  0.306423
2017-03-03 -0.197380  0.121351  0.084542  0.431696  1.654593
2017-03-04 -0.214626  0.561356  0.007392  0.668956 -0.885580
2017-03-05 -0.087092  0.558175 -1.305366 -1.049060 -0.135071
2017-03-06 -0.805994  1.733633 -1.448039 -0.115927  0.901500
2017-03-07 -0.498297 -0.959697 -0.010755  0.799920  1.673910
2017-03-08  1.241542  0.504932 -1.364214  0.492789  0.188582

df.describe() 对DataFrame 中的数据进行一个总的描述

print df.describe()

              A         B         C         D         E
count  8.000000  8.000000  8.000000  8.000000  8.000000
mean  -0.087813  0.718481 -0.267764 -0.451818  0.560079
std    1.494381  0.499839  1.070840  1.008569  1.263091
min   -2.176000  0.045393 -1.701633 -1.838860 -2.090028
25%   -1.405379  0.298836 -0.848086 -0.997443  0.119207
50%    0.107146  0.836200 -0.246589 -0.536753  0.866753
75%    1.132235  1.040045  0.179091 -0.217005  1.542020
max    1.928376  1.310759  1.741579  1.624236  1.691431

可以看到它描述了每一列的 总数量, 平均值, (标准值?) , 最小值 , 平均低4分位 , 中位值 , 平均高四分位, 最大值

 

df.mean() 会对 DataFrame 的每一列元素求平均值(元素的类型要求是数字)

求出的平均值可以通过 list( )进行强制类型转换,转换之后得到的 list 中的元素的类型也是数(float)

1 dates = pd.date_range('20170301' , periods = 8)
2 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
3 df.loc[: ,'A'] = 'a'
4 print df
5 print df.mean()
6 print list(df.mean())
7 print type(list(df.mean())[0])
            A         B         C         D         E
2017-03-01  a -0.857906  0.380529  0.531562  0.299163
2017-03-02  a  0.391248 -2.227574  0.792068 -1.100136
2017-03-03  a  0.260002  0.294271  0.392461  0.161064
2017-03-04  a -0.136737  0.018517 -0.284478  0.009943
2017-03-05  a -0.725036 -0.031868  1.289505 -0.108265
2017-03-06  a  1.616869 -1.528318  0.311700  1.386990
2017-03-07  a -0.961123 -0.244735 -0.120312 -0.595079
2017-03-08  a -0.631889  0.205291 -0.407998 -0.388415
B   -0.130572
C   -0.391736
D    0.313063
E   -0.041842
dtype: float64
[-0.13057154848170802, -0.39173579343646675, 0.31306349321693849, -0.041841901046697855]
<type 'numpy.float64'>

 

 

三:DataFrame 的选择(切片)

print df['A']
print type(df['A'])


2017-03-01   -0.158978
2017-03-02   -1.690027
2017-03-03    1.188897
2017-03-04   -0.913982
2017-03-05    0.433453
2017-03-06   -1.381605
2017-03-07    0.148752
2017-03-08    1.021067
Freq: D, Name: A, dtype: float64
<class 'pandas.core.series.Series'>

df  后面可以直接跟 columns_name 。得到的是一个 Series,包含索引和列的元素值

 

当我们需要某一行(或某几行的DataFrame 时)。我们不能直接用行的索引来选择

print df['2017-03-01']
print df[0]

例如这样的操作。都会报错

 

而应该类似于list 中的切片操作。我们既可以对 行数(0,1,2,3)进行切片,也可以对index (2017-03-01  , 2017-03-02 ,,。。。)进行切片操作来选取。

print df[0:1]
print df[1:]
print df['2017-03-01':'2017-03-03']
print df[:]



                   A         B         C         D        E
2017-03-01 -0.744223  0.517575  0.199179 -0.531218  1.18652
                   A         B         C         D         E
2017-03-02  0.297713 -1.394280  0.722143  0.194107  0.020040
2017-03-03  1.040041  0.844153 -1.523378 -0.024551  2.524847
2017-03-04 -0.136714  0.581337  0.458747 -1.616134 -0.831049
2017-03-05  1.131013  1.268097  0.392704 -0.891760  0.056044
2017-03-06 -0.479798 -0.408351 -1.041832  0.052908 -1.037984
2017-03-07  0.886389  1.528950  1.044967  1.646536 -0.394471
2017-03-08 -0.712788  0.571170 -0.916402  0.843917  1.471186
                   A         B         C         D         E
2017-03-01 -0.744223  0.517575  0.199179 -0.531218  1.186520
2017-03-02  0.297713 -1.394280  0.722143  0.194107  0.020040
2017-03-03  1.040041  0.844153 -1.523378 -0.024551  2.524847
                   A         B         C         D         E
2017-03-01 -0.744223  0.517575  0.199179 -0.531218  1.186520
2017-03-02  0.297713 -1.394280  0.722143  0.194107  0.020040
2017-03-03  1.040041  0.844153 -1.523378 -0.024551  2.524847
2017-03-04 -0.136714  0.581337  0.458747 -1.616134 -0.831049
2017-03-05  1.131013  1.268097  0.392704 -0.891760  0.056044
2017-03-06 -0.479798 -0.408351 -1.041832  0.052908 -1.037984
2017-03-07  0.886389  1.528950  1.044967  1.646536 -0.394471
2017-03-08 -0.712788  0.571170 -0.916402  0.843917  1.471186

 

还可以运用 loc函数来选取需要的元素值或者 DataFrame对象 和 Series 对象

当需要选取某一行(比如第0 行,索引值为  2017-03-01  )的时候 可以用 df.loc['2017-03-01'] 或者 df.loc[df.index[0]],但是不可以使用 df.loc[0]。将会得到一个 Series 对象

当需要选取某一列  (比如列的索引为 ‘A’)的时候,可以用 df.loc[: , 'A'],会得到一个 Series 对象

也就是说,用 loc 选取的时候,只能使用索引,而不能简单的用0,1,2.。。。来代替

当我们需要得到一个某几行几列的 DataFrame 对象的时候。需要使用切片。

与上面不同的是,loc 的切片只能用索引切片。

 1 import pandas as pd
 2 import numpy as np
 3 import matplotlib.pyplot as plt
 4 
 5 #DataFrame
 6 dates = pd.date_range('20170301' , periods = 8)
 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
 8 
 9 print df.loc['2017-03-01']
10 print
11 print df.loc[: , 'A']
12 print
13 print df.loc['2017-03-01','A']
14 print
15 print df.loc[df.index[0]:df.index[3] , 'A':'C']
A    0.523693
B    0.949603
C   -0.683277
D    0.570584
E   -0.762546
Name: 2017-03-01 00:00:00, dtype: float64

2017-03-01    0.523693
2017-03-02   -1.327872
2017-03-03   -0.426860
2017-03-04    1.924556
2017-03-05   -0.107997
2017-03-06   -1.142094
2017-03-07   -0.033565
2017-03-08   -0.055100
Freq: D, Name: A, dtype: float64

0.523692755138

                   A         B         C
2017-03-01  0.523693  0.949603 -0.683277
2017-03-02 -1.327872 -0.240553 -0.955248
2017-03-03 -0.426860 -1.569299 -0.776820
2017-03-04  1.924556  0.420573 -0.517472

 

前面我们说,在用 loc 进行选择的时候,我们只能用索引值,不能用下标(0,1,2,3......)

现在有函数 iloc 可以支持我们用下标进行选择并且只能用下标进行选择。基本上与 loc 一致。只是 loc 的所有索引都需要被换成下标。

1 print df.iloc[0]
2 print
3 print df.iloc[: , 0]
4 print
5 print df.iloc[0,0]
6 print
7 print df.iloc[0:3 , 0:3]
A   -0.746018
B   -2.008161
C    0.662723
D   -1.446216
E   -1.069992
Name: 2017-03-01 00:00:00, dtype: float64

2017-03-01   -0.746018
2017-03-02    0.355898
2017-03-03    0.224572
2017-03-04    0.491077
2017-03-05    0.189671
2017-03-06    1.287336
2017-03-07    0.625124
2017-03-08   -1.064447
Freq: D, Name: A, dtype: float64

-0.746017502389

                   A         B         C
2017-03-01 -0.746018 -2.008161  0.662723
2017-03-02  0.355898 -0.773666  0.741954
2017-03-03  0.224572  1.839602 -1.701422

 

 筛选某些具有特定要求的数据得到一个新的 DataFrame

df[(df.A > 0 )  & (df['B'] > 0) ]

df.loc[  (df.A > 0) | df['B'] < 0 ]

1 print df
2 print df[(df.A > 0) & (df['B'] > 0)]
3 print df.loc[(df.A > 0) | (df['B'] < 0)]
                   A         B         C         D         E
2017-03-01  0.399499 -0.301952  0.829142  0.378531 -0.372409
2017-03-02  1.856642 -0.569681 -0.639396  0.352889 -0.579640
2017-03-03 -0.688705 -1.020069  0.694585  0.954841  0.108886
2017-03-04 -0.251342  0.963177 -1.245065 -0.405680 -0.264811
2017-03-05 -0.421710 -0.404864  0.295869 -1.315680  1.849906
2017-03-06  1.036118 -1.373403 -0.297122 -0.795075 -0.245171
2017-03-07  0.601060  1.765738  0.948425 -0.574575  1.008444
2017-03-08 -0.587488 -0.696066 -1.634978 -0.416340  0.791085
                  A         B         C         D         E
2017-03-07  0.60106  1.765738  0.948425 -0.574575  1.008444
                   A         B         C         D         E
2017-03-01  0.399499 -0.301952  0.829142  0.378531 -0.372409
2017-03-02  1.856642 -0.569681 -0.639396  0.352889 -0.579640
2017-03-03 -0.688705 -1.020069  0.694585  0.954841  0.108886
2017-03-05 -0.421710 -0.404864  0.295869 -1.315680  1.849906
2017-03-06  1.036118 -1.373403 -0.297122 -0.795075 -0.245171
2017-03-07  0.601060  1.765738  0.948425 -0.574575  1.008444
2017-03-08 -0.587488 -0.696066 -1.634978 -0.416340  0.791085

当有多个条件时,每个条件需要用 () 括起来并且 用 &(且) 和  |(或)  进行连接。

 

df[df > 0] 筛选出 df 中元素值大于0的元素。对于小于0 的元素填 np.nan。返回得到的这个新  DataFrame

print df
print df[df > 0]

                   A         B         C         D         E
2017-03-01  1.677450  2.163308  1.062092 -0.523620  0.628484
2017-03-02 -0.246469  1.167712  0.422173 -1.267306  0.452185
2017-03-03 -0.016746 -1.110537 -2.106998 -0.715175 -1.450872
2017-03-04  0.900309  1.416489  1.389152  0.416001  1.557737
2017-03-05  0.577419  0.525642 -2.726353 -0.506887 -0.765607
2017-03-06 -0.598997  2.052256  0.204728  1.783496 -1.765711
2017-03-07 -1.267873  0.856503  1.236517 -1.239220  0.536613
2017-03-08 -2.534660 -1.395564 -0.542685  0.800363 -1.008428
                   A         B         C         D         E
2017-03-01  1.677450  2.163308  1.062092       NaN  0.628484
2017-03-02       NaN  1.167712  0.422173       NaN  0.452185
2017-03-03       NaN       NaN       NaN       NaN       NaN
2017-03-04  0.900309  1.416489  1.389152  0.416001  1.557737
2017-03-05  0.577419  0.525642       NaN       NaN       NaN
2017-03-06       NaN  2.052256  0.204728  1.783496       NaN
2017-03-07       NaN  0.856503  1.236517       NaN  0.536613
2017-03-08       NaN       NaN       NaN  0.800363       NaN

df[  df['E'].isin[ list ] ] 

df.loc[:3,'E'] = 2
df.loc[3: , 'E'] =3
print df
print df[df['E'].isin([1,2])]


                   A         B         C         D    E
2017-03-01  0.831030  0.091797 -1.372896 -0.209519  2.0
2017-03-02 -0.207082  1.756175  0.814452  0.919294  2.0
2017-03-03 -0.309872  0.823114 -0.667895 -0.723452  2.0
2017-03-04 -0.232162 -0.387264 -0.366248  0.908574  3.0
2017-03-05  0.382886 -1.131076 -0.369336 -0.128234  3.0
2017-03-06  0.665425 -0.240306  0.167547  0.215651  3.0
2017-03-07  0.709806  1.931120 -1.107219  0.331201  3.0
2017-03-08  0.527246  0.683884  0.084874  1.195304  3.0
                   A         B         C         D    E
2017-03-01  0.831030  0.091797 -1.372896 -0.209519  2.0
2017-03-02 -0.207082  1.756175  0.814452  0.919294  2.0
2017-03-03 -0.309872  0.823114 -0.667895 -0.723452  2.0

 

四:对DataFrame 的设置

可以直接通过上面的选择方法来选择一行或者一列或者某一个行列均确定的元素,然后赋值直接修改即可。

主要过程和上面基本相同。

另外,DataFrame 是可以进行 四则运算的。它会对相应位置上的元素值进行四则运算从而得到一个新的 DataFrame

 

五:对 DataFrame 缺失值的处理

可以删除缺失值

df.dropna(axis = 0/1 , how = 'any'/'all') axis 默认为0。 删除某一行(axis = 0 ),或者某一列(axis = 1),如果这一行/列 全部是空值(how = 'all'),有至少有一个空值( how = 'any')

可以填充全部缺失值

df.fillna(value = value)

 1 #-*_-coding:utf-8-*-
 2 import pandas as pd
 3 import numpy as np
 4 import matplotlib.pyplot as plt
 5 
 6 #DataFrame
 7 dates = pd.date_range('20170301' , periods = 8)
 8 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE'))
 9 s = pd.Series([1]*3 + [np.nan] * (len(df) -3) , pd.date_range('20170301' , periods = 8))
10 
11 df.E = s
12 df.loc[: , 'F'] = np.nan
13 df.loc[df.index[-1]] = np.nan
14 print df
15 
16 print df.dropna(axis = 1 , how = 'all')
17 print df.dropna(axis = 1 , how = 'all').dropna(axis = 0 , how = 'any')
18 print df.dropna(how = 'all')
                   A         B         C         D    E   F
2017-03-01  1.811369  1.316996 -0.641261 -0.448455  1.0 NaN
2017-03-02 -0.019703  0.749759 -0.009580  0.715036  1.0 NaN
2017-03-03  1.347926  1.026859 -1.084211 -0.813363  1.0 NaN
2017-03-04  1.583241 -0.277278 -0.303702  1.724784  NaN NaN
2017-03-05 -1.030510  0.311998 -2.508356 -0.824971  NaN NaN
2017-03-06 -0.322945 -0.215030  0.356070 -1.027667  NaN NaN
2017-03-07  0.315569 -0.780942  0.951732  0.018470  NaN NaN
2017-03-08       NaN       NaN       NaN       NaN  NaN NaN
                   A         B         C         D    E
2017-03-01  1.811369  1.316996 -0.641261 -0.448455  1.0
2017-03-02 -0.019703  0.749759 -0.009580  0.715036  1.0
2017-03-03  1.347926  1.026859 -1.084211 -0.813363  1.0
2017-03-04  1.583241 -0.277278 -0.303702  1.724784  NaN
2017-03-05 -1.030510  0.311998 -2.508356 -0.824971  NaN
2017-03-06 -0.322945 -0.215030  0.356070 -1.027667  NaN
2017-03-07  0.315569 -0.780942  0.951732  0.018470  NaN
2017-03-08       NaN       NaN       NaN       NaN  NaN
                   A         B         C         D    E
2017-03-01  1.811369  1.316996 -0.641261 -0.448455  1.0
2017-03-02 -0.019703  0.749759 -0.009580  0.715036  1.0
2017-03-03  1.347926  1.026859 -1.084211 -0.813363  1.0
                   A         B         C         D    E   F
2017-03-01  1.811369  1.316996 -0.641261 -0.448455  1.0 NaN
2017-03-02 -0.019703  0.749759 -0.009580  0.715036  1.0 NaN
2017-03-03  1.347926  1.026859 -1.084211 -0.813363  1.0 NaN
2017-03-04  1.583241 -0.277278 -0.303702  1.724784  NaN NaN
2017-03-05 -1.030510  0.311998 -2.508356 -0.824971  NaN NaN
2017-03-06 -0.322945 -0.215030  0.356070 -1.027667  NaN NaN
2017-03-07  0.315569 -0.780942  0.951732  0.018470  NaN NaN
print df
print df.fillna(value = 2)


                   A         B         C         D    E   F
2017-03-01  0.529989  1.278479 -2.450377  1.019220  1.0 NaN
2017-03-02 -0.834147  0.563709  2.127497 -0.004560  1.0 NaN
2017-03-03 -1.630047 -0.251976 -0.217972  1.530107  1.0 NaN
2017-03-04  1.012212 -0.197851  2.217734  0.290256  NaN NaN
2017-03-05  1.259308  0.102747  0.183875 -0.048879  NaN NaN
2017-03-06  0.199627  1.776640  1.347103 -1.655109  NaN NaN
2017-03-07 -0.144254  0.533370  0.692462  0.690940  NaN NaN
2017-03-08       NaN       NaN       NaN       NaN  NaN NaN
                   A         B         C         D    E    F
2017-03-01  0.529989  1.278479 -2.450377  1.019220  1.0  2.0
2017-03-02 -0.834147  0.563709  2.127497 -0.004560  1.0  2.0
2017-03-03 -1.630047 -0.251976 -0.217972  1.530107  1.0  2.0
2017-03-04  1.012212 -0.197851  2.217734  0.290256  2.0  2.0
2017-03-05  1.259308  0.102747  0.183875 -0.048879  2.0  2.0
2017-03-06  0.199627  1.776640  1.347103 -1.655109  2.0  2.0
2017-03-07 -0.144254  0.533370  0.692462  0.690940  2.0  2.0
2017-03-08  2.000000  2.000000  2.000000  2.000000  2.0  2.0

 

总结一下:

1.首先是创建。包括 Series 和 DataFrame 的创建。

 其中 Series = pd.Series(values , keys)

  DateFrame = pd.DataFrame(list(values)  , index = list , columns = list)

2.其次是 DataFrame 的一些自带的函数功能调用。

  df.head() , df.tail() , df.sort_values() , df.mean() , df.describe()。以及df.values , df.index ,df.T 等等

3.主要需要掌握对DataFrame的数据选择处理

  df.value可以直接选择某一列的数据 , df[  ]只能使用切片选择,可以使用索引和下标切片

  df.loc[]需要使用索引, df.iloc[]需要使用下标

  df[df.values > 0]可以筛选出满足条件的一个DataFrame出来。需要满足多个条件时df[(df.A > 0)&(df.B < 0)]

4.缺失值的处理。删除或填充。   df.dropna(axis = 0  , how)      df.fillna(value)

posted @ 2017-06-29 10:14  山羊0.0  阅读(582)  评论(0编辑  收藏  举报