数据处理之pandas简单介绍
Offical Website :http://pandas.pydata.org/
一:两种基本的数据类型结构 Series 和 DataFrame
先来看一下Series
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #Series 6 s = pd.Series([i*2 for i in xrange(1 , 11)]) 7 print s
打印结果为:
0 2
1 4
2 6
3 8
4 10
5 12
6 14
7 16
8 18
9 20
dtype: int64
其中。前面的0--9是索引值,后面的2,4,6...是我们传递的 list 中的值。
然后看一下DataFrame
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #DataFrame 6 dates = pd.date_range('20170301' , periods = 8) 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 8 print df
运行结果:
A B C D E
2017-03-01 1.446957 -0.969023 -0.272529 0.695884 -0.842616
2017-03-02 -0.193140 1.231356 0.761668 -0.859277 -1.002324
2017-03-03 -0.441364 1.059026 0.392266 1.180888 0.144625
2017-03-04 0.510129 0.851746 0.110843 0.745591 -0.724988
2017-03-05 0.417613 -0.640111 -1.048320 1.605048 0.935129
2017-03-06 0.805600 0.491515 0.042078 0.081229 -0.293101
2017-03-07 -1.597687 0.268910 1.078853 -1.488760 -1.881305
2017-03-08 -2.414063 1.147526 0.143332 0.622884 1.760944
其中,第一个参数 np.random.randn(8 , 5) 会返回一个8行5列的 array , 其中的元素值为满足标准正态分布的随机数
第二个参数 index = dates (dates 是一个数组)传递了DataFrame 的索引值
第三个参数 columns = list('ABCDE') 传递了这个 DataFrame 对象每一列的标签
另外。DataFrame 接收的参数还可以是一个字典。key 对应列的标签,value 对应列的元素值。具体有多少行根据 每一个key 中 value 值最多的来确定。
df = pd.DataFrame({'A':1,'B':[1,2,3,4]}) print df A B 0 1 1 1 1 2 2 1 3 3 1 4
可以看到。Series 是 DataFrame 中的一个组成部分,或者说是一种特殊的 DataFrame。DataFrame 又是许多 Series 的集合。
二:DataFrame的基本操作
df.head(n = 5)返回原 df 对象的前 n 行。n 默认为5
df.tail(n = 5)返回原 df 对象的后 n 行。n 默认为5
df.index 返回 df 对象的索引值
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #DataFrame 6 dates = pd.date_range('20170301' , periods = 8) 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 8 #print df 9 10 #Basic 11 print 'Head' 12 print df.head(3) 13 print 'Tail' 14 print df.tail(3) 15 print 16 print df.index
Head A B C D E 2017-03-01 0.872154 0.887637 0.877745 0.170153 -0.595866 2017-03-02 -2.260319 -1.400152 -0.347347 -0.880254 -0.388510 2017-03-03 -0.032758 0.393881 -0.279599 1.904316 -1.292630 Tail A B C D E 2017-03-06 -0.116548 -0.459674 0.671389 -0.536236 1.224103 2017-03-07 -0.067690 0.678551 -0.258071 -0.352931 0.415018 2017-03-08 0.006201 0.464584 0.141018 -0.076282 -0.638886 DatetimeIndex(['2017-03-01', '2017-03-02', '2017-03-03', '2017-03-04', '2017-03-05', '2017-03-06', '2017-03-07', '2017-03-08'], dtype='datetime64[ns]', freq='D')
df.values 返回 df 对象中的元素值。并且返回的对象类型是一个 numpy.ndarray
df.T 返回一个转置过的 df 对象(行列交换)
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #DataFrame 6 dates = pd.date_range('20170301' , periods = 8) 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 8 9 print df.values 10 print df.T
[[ 0.08981458 2.35966602 0.00606022 0.08633954 1.05939747] [-1.05151225 -1.19768201 1.83672123 1.20769635 -0.30581458] [-0.17192213 -0.75261065 1.04369857 -0.14874237 2.07925093] [-0.94600881 0.68897204 -0.18006348 -1.39294212 -0.24695665] [ 0.7730522 -1.62446734 -1.35308009 2.97657871 0.56537233] [ 0.24186251 0.56652445 -0.00513021 0.14593751 0.07460181] [-1.52712564 0.79666412 -1.68573768 0.85084609 0.48469802] [ 1.49180784 -0.04688902 -0.89278834 -0.81667428 -0.15639693]]
2017-03-01 2017-03-02 2017-03-03 2017-03-04 2017-03-05 2017-03-06 \ A 0.089815 -1.051512 -0.171922 -0.946009 0.773052 0.241863 B 2.359666 -1.197682 -0.752611 0.688972 -1.624467 0.566524 C 0.006060 1.836721 1.043699 -0.180063 -1.353080 -0.005130 D 0.086340 1.207696 -0.148742 -1.392942 2.976579 0.145938 E 1.059397 -0.305815 2.079251 -0.246957 0.565372 0.074602 2017-03-07 2017-03-08 A -1.527126 1.491808 B 0.796664 -0.046889 C -1.685738 -0.892788 D 0.850846 -0.816674 E 0.484698 -0.156397
df.sort_values(['columns_name'] , ascending = (0/1)) 排序。对指定的列(columns_name)按指定的方式( 0 降序 , 1 升序)来排序。
df.sort_index(axis = 0/1 , ascending = 0/1) 排序。可对行索引(axis = 0)和列索引(axis = 1)按照降序(ascending =0)和升序(ascending = 1)进行排序
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #DataFrame 6 dates = pd.date_range('20170301' , periods = 8) 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 8 9 print df.sort_values('A' , ascending = 1) 10 print df.sort_index(axis = 0, ascending = 0) 11 print df.sort_index(axis = 1, ascending = 0)
A B C D E 2017-03-04 -0.885580 0.668956 0.007392 0.561356 -0.214626 2017-03-05 -0.135071 -1.049060 -1.305366 0.558175 -0.087092 2017-03-08 0.188582 0.492789 -1.364214 0.504932 1.241542 2017-03-02 0.306423 -1.595937 0.532442 0.773825 0.196982 2017-03-06 0.901500 -0.115927 -1.448039 1.733633 -0.805994 2017-03-01 0.951188 -2.335634 1.592160 0.166211 -0.716212 2017-03-03 1.654593 0.431696 0.084542 0.121351 -0.197380 2017-03-07 1.673910 0.799920 -0.010755 -0.959697 -0.498297 A B C D E 2017-03-08 0.188582 0.492789 -1.364214 0.504932 1.241542 2017-03-07 1.673910 0.799920 -0.010755 -0.959697 -0.498297 2017-03-06 0.901500 -0.115927 -1.448039 1.733633 -0.805994 2017-03-05 -0.135071 -1.049060 -1.305366 0.558175 -0.087092 2017-03-04 -0.885580 0.668956 0.007392 0.561356 -0.214626 2017-03-03 1.654593 0.431696 0.084542 0.121351 -0.197380 2017-03-02 0.306423 -1.595937 0.532442 0.773825 0.196982 2017-03-01 0.951188 -2.335634 1.592160 0.166211 -0.716212 E D C B A 2017-03-01 -0.716212 0.166211 1.592160 -2.335634 0.951188 2017-03-02 0.196982 0.773825 0.532442 -1.595937 0.306423 2017-03-03 -0.197380 0.121351 0.084542 0.431696 1.654593 2017-03-04 -0.214626 0.561356 0.007392 0.668956 -0.885580 2017-03-05 -0.087092 0.558175 -1.305366 -1.049060 -0.135071 2017-03-06 -0.805994 1.733633 -1.448039 -0.115927 0.901500 2017-03-07 -0.498297 -0.959697 -0.010755 0.799920 1.673910 2017-03-08 1.241542 0.504932 -1.364214 0.492789 0.188582
df.describe() 对DataFrame 中的数据进行一个总的描述
print df.describe() A B C D E count 8.000000 8.000000 8.000000 8.000000 8.000000 mean -0.087813 0.718481 -0.267764 -0.451818 0.560079 std 1.494381 0.499839 1.070840 1.008569 1.263091 min -2.176000 0.045393 -1.701633 -1.838860 -2.090028 25% -1.405379 0.298836 -0.848086 -0.997443 0.119207 50% 0.107146 0.836200 -0.246589 -0.536753 0.866753 75% 1.132235 1.040045 0.179091 -0.217005 1.542020 max 1.928376 1.310759 1.741579 1.624236 1.691431
可以看到它描述了每一列的 总数量, 平均值, (标准值?) , 最小值 , 平均低4分位 , 中位值 , 平均高四分位, 最大值
df.mean() 会对 DataFrame 的每一列元素求平均值(元素的类型要求是数字)
求出的平均值可以通过 list( )进行强制类型转换,转换之后得到的 list 中的元素的类型也是数(float)
1 dates = pd.date_range('20170301' , periods = 8) 2 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 3 df.loc[: ,'A'] = 'a' 4 print df 5 print df.mean() 6 print list(df.mean()) 7 print type(list(df.mean())[0])
A B C D E 2017-03-01 a -0.857906 0.380529 0.531562 0.299163 2017-03-02 a 0.391248 -2.227574 0.792068 -1.100136 2017-03-03 a 0.260002 0.294271 0.392461 0.161064 2017-03-04 a -0.136737 0.018517 -0.284478 0.009943 2017-03-05 a -0.725036 -0.031868 1.289505 -0.108265 2017-03-06 a 1.616869 -1.528318 0.311700 1.386990 2017-03-07 a -0.961123 -0.244735 -0.120312 -0.595079 2017-03-08 a -0.631889 0.205291 -0.407998 -0.388415 B -0.130572 C -0.391736 D 0.313063 E -0.041842 dtype: float64 [-0.13057154848170802, -0.39173579343646675, 0.31306349321693849, -0.041841901046697855] <type 'numpy.float64'>
三:DataFrame 的选择(切片)
print df['A'] print type(df['A']) 2017-03-01 -0.158978 2017-03-02 -1.690027 2017-03-03 1.188897 2017-03-04 -0.913982 2017-03-05 0.433453 2017-03-06 -1.381605 2017-03-07 0.148752 2017-03-08 1.021067 Freq: D, Name: A, dtype: float64 <class 'pandas.core.series.Series'>
df 后面可以直接跟 columns_name 。得到的是一个 Series,包含索引和列的元素值
当我们需要某一行(或某几行的DataFrame 时)。我们不能直接用行的索引来选择
print df['2017-03-01']
print df[0]
例如这样的操作。都会报错
而应该类似于list 中的切片操作。我们既可以对 行数(0,1,2,3)进行切片,也可以对index (2017-03-01 , 2017-03-02 ,,。。。)进行切片操作来选取。
print df[0:1] print df[1:] print df['2017-03-01':'2017-03-03'] print df[:] A B C D E 2017-03-01 -0.744223 0.517575 0.199179 -0.531218 1.18652 A B C D E 2017-03-02 0.297713 -1.394280 0.722143 0.194107 0.020040 2017-03-03 1.040041 0.844153 -1.523378 -0.024551 2.524847 2017-03-04 -0.136714 0.581337 0.458747 -1.616134 -0.831049 2017-03-05 1.131013 1.268097 0.392704 -0.891760 0.056044 2017-03-06 -0.479798 -0.408351 -1.041832 0.052908 -1.037984 2017-03-07 0.886389 1.528950 1.044967 1.646536 -0.394471 2017-03-08 -0.712788 0.571170 -0.916402 0.843917 1.471186 A B C D E 2017-03-01 -0.744223 0.517575 0.199179 -0.531218 1.186520 2017-03-02 0.297713 -1.394280 0.722143 0.194107 0.020040 2017-03-03 1.040041 0.844153 -1.523378 -0.024551 2.524847 A B C D E 2017-03-01 -0.744223 0.517575 0.199179 -0.531218 1.186520 2017-03-02 0.297713 -1.394280 0.722143 0.194107 0.020040 2017-03-03 1.040041 0.844153 -1.523378 -0.024551 2.524847 2017-03-04 -0.136714 0.581337 0.458747 -1.616134 -0.831049 2017-03-05 1.131013 1.268097 0.392704 -0.891760 0.056044 2017-03-06 -0.479798 -0.408351 -1.041832 0.052908 -1.037984 2017-03-07 0.886389 1.528950 1.044967 1.646536 -0.394471 2017-03-08 -0.712788 0.571170 -0.916402 0.843917 1.471186
还可以运用 loc函数来选取需要的元素值或者 DataFrame对象 和 Series 对象
当需要选取某一行(比如第0 行,索引值为 2017-03-01 )的时候 可以用 df.loc['2017-03-01'] 或者 df.loc[df.index[0]],但是不可以使用 df.loc[0]。将会得到一个 Series 对象
当需要选取某一列 (比如列的索引为 ‘A’)的时候,可以用 df.loc[: , 'A'],会得到一个 Series 对象
也就是说,用 loc 选取的时候,只能使用索引,而不能简单的用0,1,2.。。。来代替
当我们需要得到一个某几行几列的 DataFrame 对象的时候。需要使用切片。
与上面不同的是,loc 的切片只能用索引切片。
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 5 #DataFrame 6 dates = pd.date_range('20170301' , periods = 8) 7 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 8 9 print df.loc['2017-03-01'] 10 print 11 print df.loc[: , 'A'] 12 print 13 print df.loc['2017-03-01','A'] 14 print 15 print df.loc[df.index[0]:df.index[3] , 'A':'C']
A 0.523693 B 0.949603 C -0.683277 D 0.570584 E -0.762546 Name: 2017-03-01 00:00:00, dtype: float64 2017-03-01 0.523693 2017-03-02 -1.327872 2017-03-03 -0.426860 2017-03-04 1.924556 2017-03-05 -0.107997 2017-03-06 -1.142094 2017-03-07 -0.033565 2017-03-08 -0.055100 Freq: D, Name: A, dtype: float64 0.523692755138 A B C 2017-03-01 0.523693 0.949603 -0.683277 2017-03-02 -1.327872 -0.240553 -0.955248 2017-03-03 -0.426860 -1.569299 -0.776820 2017-03-04 1.924556 0.420573 -0.517472
前面我们说,在用 loc 进行选择的时候,我们只能用索引值,不能用下标(0,1,2,3......)
现在有函数 iloc 可以支持我们用下标进行选择并且只能用下标进行选择。基本上与 loc 一致。只是 loc 的所有索引都需要被换成下标。
1 print df.iloc[0] 2 print 3 print df.iloc[: , 0] 4 print 5 print df.iloc[0,0] 6 print 7 print df.iloc[0:3 , 0:3]
A -0.746018 B -2.008161 C 0.662723 D -1.446216 E -1.069992 Name: 2017-03-01 00:00:00, dtype: float64 2017-03-01 -0.746018 2017-03-02 0.355898 2017-03-03 0.224572 2017-03-04 0.491077 2017-03-05 0.189671 2017-03-06 1.287336 2017-03-07 0.625124 2017-03-08 -1.064447 Freq: D, Name: A, dtype: float64 -0.746017502389 A B C 2017-03-01 -0.746018 -2.008161 0.662723 2017-03-02 0.355898 -0.773666 0.741954 2017-03-03 0.224572 1.839602 -1.701422
筛选某些具有特定要求的数据得到一个新的 DataFrame
df[(df.A > 0 ) & (df['B'] > 0) ]
df.loc[ (df.A > 0) | df['B'] < 0 ]
1 print df 2 print df[(df.A > 0) & (df['B'] > 0)] 3 print df.loc[(df.A > 0) | (df['B'] < 0)]
A B C D E 2017-03-01 0.399499 -0.301952 0.829142 0.378531 -0.372409 2017-03-02 1.856642 -0.569681 -0.639396 0.352889 -0.579640 2017-03-03 -0.688705 -1.020069 0.694585 0.954841 0.108886 2017-03-04 -0.251342 0.963177 -1.245065 -0.405680 -0.264811 2017-03-05 -0.421710 -0.404864 0.295869 -1.315680 1.849906 2017-03-06 1.036118 -1.373403 -0.297122 -0.795075 -0.245171 2017-03-07 0.601060 1.765738 0.948425 -0.574575 1.008444 2017-03-08 -0.587488 -0.696066 -1.634978 -0.416340 0.791085 A B C D E 2017-03-07 0.60106 1.765738 0.948425 -0.574575 1.008444 A B C D E 2017-03-01 0.399499 -0.301952 0.829142 0.378531 -0.372409 2017-03-02 1.856642 -0.569681 -0.639396 0.352889 -0.579640 2017-03-03 -0.688705 -1.020069 0.694585 0.954841 0.108886 2017-03-05 -0.421710 -0.404864 0.295869 -1.315680 1.849906 2017-03-06 1.036118 -1.373403 -0.297122 -0.795075 -0.245171 2017-03-07 0.601060 1.765738 0.948425 -0.574575 1.008444 2017-03-08 -0.587488 -0.696066 -1.634978 -0.416340 0.791085
当有多个条件时,每个条件需要用 () 括起来并且 用 &(且) 和 |(或) 进行连接。
df[df > 0] 筛选出 df 中元素值大于0的元素。对于小于0 的元素填 np.nan。返回得到的这个新 DataFrame
print df print df[df > 0] A B C D E 2017-03-01 1.677450 2.163308 1.062092 -0.523620 0.628484 2017-03-02 -0.246469 1.167712 0.422173 -1.267306 0.452185 2017-03-03 -0.016746 -1.110537 -2.106998 -0.715175 -1.450872 2017-03-04 0.900309 1.416489 1.389152 0.416001 1.557737 2017-03-05 0.577419 0.525642 -2.726353 -0.506887 -0.765607 2017-03-06 -0.598997 2.052256 0.204728 1.783496 -1.765711 2017-03-07 -1.267873 0.856503 1.236517 -1.239220 0.536613 2017-03-08 -2.534660 -1.395564 -0.542685 0.800363 -1.008428 A B C D E 2017-03-01 1.677450 2.163308 1.062092 NaN 0.628484 2017-03-02 NaN 1.167712 0.422173 NaN 0.452185 2017-03-03 NaN NaN NaN NaN NaN 2017-03-04 0.900309 1.416489 1.389152 0.416001 1.557737 2017-03-05 0.577419 0.525642 NaN NaN NaN 2017-03-06 NaN 2.052256 0.204728 1.783496 NaN 2017-03-07 NaN 0.856503 1.236517 NaN 0.536613 2017-03-08 NaN NaN NaN 0.800363 NaN
df[ df['E'].isin[ list ] ]
df.loc[:3,'E'] = 2 df.loc[3: , 'E'] =3 print df print df[df['E'].isin([1,2])] A B C D E 2017-03-01 0.831030 0.091797 -1.372896 -0.209519 2.0 2017-03-02 -0.207082 1.756175 0.814452 0.919294 2.0 2017-03-03 -0.309872 0.823114 -0.667895 -0.723452 2.0 2017-03-04 -0.232162 -0.387264 -0.366248 0.908574 3.0 2017-03-05 0.382886 -1.131076 -0.369336 -0.128234 3.0 2017-03-06 0.665425 -0.240306 0.167547 0.215651 3.0 2017-03-07 0.709806 1.931120 -1.107219 0.331201 3.0 2017-03-08 0.527246 0.683884 0.084874 1.195304 3.0 A B C D E 2017-03-01 0.831030 0.091797 -1.372896 -0.209519 2.0 2017-03-02 -0.207082 1.756175 0.814452 0.919294 2.0 2017-03-03 -0.309872 0.823114 -0.667895 -0.723452 2.0
四:对DataFrame 的设置
可以直接通过上面的选择方法来选择一行或者一列或者某一个行列均确定的元素,然后赋值直接修改即可。
主要过程和上面基本相同。
另外,DataFrame 是可以进行 四则运算的。它会对相应位置上的元素值进行四则运算从而得到一个新的 DataFrame
五:对 DataFrame 缺失值的处理
可以删除缺失值
df.dropna(axis = 0/1 , how = 'any'/'all') axis 默认为0。 删除某一行(axis = 0 ),或者某一列(axis = 1),如果这一行/列 全部是空值(how = 'all'),有至少有一个空值( how = 'any')
可以填充全部缺失值
df.fillna(value = value)
1 #-*_-coding:utf-8-*- 2 import pandas as pd 3 import numpy as np 4 import matplotlib.pyplot as plt 5 6 #DataFrame 7 dates = pd.date_range('20170301' , periods = 8) 8 df = pd.DataFrame(np.random.randn(8 , 5) , index = dates , columns = list('ABCDE')) 9 s = pd.Series([1]*3 + [np.nan] * (len(df) -3) , pd.date_range('20170301' , periods = 8)) 10 11 df.E = s 12 df.loc[: , 'F'] = np.nan 13 df.loc[df.index[-1]] = np.nan 14 print df 15 16 print df.dropna(axis = 1 , how = 'all') 17 print df.dropna(axis = 1 , how = 'all').dropna(axis = 0 , how = 'any') 18 print df.dropna(how = 'all')
A B C D E F 2017-03-01 1.811369 1.316996 -0.641261 -0.448455 1.0 NaN 2017-03-02 -0.019703 0.749759 -0.009580 0.715036 1.0 NaN 2017-03-03 1.347926 1.026859 -1.084211 -0.813363 1.0 NaN 2017-03-04 1.583241 -0.277278 -0.303702 1.724784 NaN NaN 2017-03-05 -1.030510 0.311998 -2.508356 -0.824971 NaN NaN 2017-03-06 -0.322945 -0.215030 0.356070 -1.027667 NaN NaN 2017-03-07 0.315569 -0.780942 0.951732 0.018470 NaN NaN 2017-03-08 NaN NaN NaN NaN NaN NaN A B C D E 2017-03-01 1.811369 1.316996 -0.641261 -0.448455 1.0 2017-03-02 -0.019703 0.749759 -0.009580 0.715036 1.0 2017-03-03 1.347926 1.026859 -1.084211 -0.813363 1.0 2017-03-04 1.583241 -0.277278 -0.303702 1.724784 NaN 2017-03-05 -1.030510 0.311998 -2.508356 -0.824971 NaN 2017-03-06 -0.322945 -0.215030 0.356070 -1.027667 NaN 2017-03-07 0.315569 -0.780942 0.951732 0.018470 NaN 2017-03-08 NaN NaN NaN NaN NaN A B C D E 2017-03-01 1.811369 1.316996 -0.641261 -0.448455 1.0 2017-03-02 -0.019703 0.749759 -0.009580 0.715036 1.0 2017-03-03 1.347926 1.026859 -1.084211 -0.813363 1.0 A B C D E F 2017-03-01 1.811369 1.316996 -0.641261 -0.448455 1.0 NaN 2017-03-02 -0.019703 0.749759 -0.009580 0.715036 1.0 NaN 2017-03-03 1.347926 1.026859 -1.084211 -0.813363 1.0 NaN 2017-03-04 1.583241 -0.277278 -0.303702 1.724784 NaN NaN 2017-03-05 -1.030510 0.311998 -2.508356 -0.824971 NaN NaN 2017-03-06 -0.322945 -0.215030 0.356070 -1.027667 NaN NaN 2017-03-07 0.315569 -0.780942 0.951732 0.018470 NaN NaN
print df print df.fillna(value = 2) A B C D E F 2017-03-01 0.529989 1.278479 -2.450377 1.019220 1.0 NaN 2017-03-02 -0.834147 0.563709 2.127497 -0.004560 1.0 NaN 2017-03-03 -1.630047 -0.251976 -0.217972 1.530107 1.0 NaN 2017-03-04 1.012212 -0.197851 2.217734 0.290256 NaN NaN 2017-03-05 1.259308 0.102747 0.183875 -0.048879 NaN NaN 2017-03-06 0.199627 1.776640 1.347103 -1.655109 NaN NaN 2017-03-07 -0.144254 0.533370 0.692462 0.690940 NaN NaN 2017-03-08 NaN NaN NaN NaN NaN NaN A B C D E F 2017-03-01 0.529989 1.278479 -2.450377 1.019220 1.0 2.0 2017-03-02 -0.834147 0.563709 2.127497 -0.004560 1.0 2.0 2017-03-03 -1.630047 -0.251976 -0.217972 1.530107 1.0 2.0 2017-03-04 1.012212 -0.197851 2.217734 0.290256 2.0 2.0 2017-03-05 1.259308 0.102747 0.183875 -0.048879 2.0 2.0 2017-03-06 0.199627 1.776640 1.347103 -1.655109 2.0 2.0 2017-03-07 -0.144254 0.533370 0.692462 0.690940 2.0 2.0 2017-03-08 2.000000 2.000000 2.000000 2.000000 2.0 2.0
总结一下:
1.首先是创建。包括 Series 和 DataFrame 的创建。
其中 Series = pd.Series(values , keys)
DateFrame = pd.DataFrame(list(values) , index = list , columns = list)
2.其次是 DataFrame 的一些自带的函数功能调用。
df.head() , df.tail() , df.sort_values() , df.mean() , df.describe()。以及df.values , df.index ,df.T 等等
3.主要需要掌握对DataFrame的数据选择处理
df.value可以直接选择某一列的数据 , df[ ]只能使用切片选择,可以使用索引和下标切片
df.loc[]需要使用索引, df.iloc[]需要使用下标
df[df.values > 0]可以筛选出满足条件的一个DataFrame出来。需要满足多个条件时df[(df.A > 0)&(df.B < 0)]
4.缺失值的处理。删除或填充。 df.dropna(axis = 0 , how) df.fillna(value)