pandas: DataFrame(二)
pandas:DataFrame数据对齐与缺失数据
DataFrame对象在运算时,同样会对数据对齐,结果的行索引和列索引分别为两个操作数的行索引与列索引的并集
DataFrame处理缺失数据的方法
1 dropna(axis=0,how='any') #清除缺失数据,axis=0表示按行进行清除,axis=1表示按列清楚,how=any表示如果有任意一个NaN就清除,how=all表示该行(列)中的所有值为NaN就清除 2 3 fillna()设置缺失值 4 isnull()是否为空 5 notnull()不为空 6 7 In [62]: df2 8 Out[62]: 9 open close high 10 0 22.074 20.657 22.503 11 1 20.750 20.489 20.944 12 2 20.300 19.593 20.384 13 3 19.426 19.977 20.308 14 4 19.995 20.520 20.706 15 5 20.353 20.273 20.454 16 6 20.264 20.101 20.353 17 7 19.999 19.739 19.999 18 8 19.783 19.818 19.982 19 9 19.558 19.841 19.911 20 21 In [63]: df3 22 Out[63]: 23 date open close low 24 0 2007-03-01 22.074 20.657 20.220 25 1 2007-03-02 20.750 20.489 20.256 26 2 2007-03-05 20.300 19.593 19.218 27 3 2007-03-06 19.426 19.977 19.315 28 4 2007-03-07 19.995 20.520 19.827 29 5 2007-03-08 20.353 20.273 20.167 30 6 2007-03-09 20.264 20.101 19.735 31 7 2007-03-12 19.999 19.739 19.646 32 8 2007-03-13 19.783 19.818 19.699 33 9 2007-03-14 19.558 19.841 19.333 34 35 In [64]: df4 = df2+df3 36 37 In [65]: df4 38 Out[65]: 39 close date high low open 40 0 41.314 NaN NaN NaN 44.148 41 1 40.978 NaN NaN NaN 41.500 42 2 39.186 NaN NaN NaN 40.600 43 3 39.954 NaN NaN NaN 38.852 44 4 41.040 NaN NaN NaN 39.990 45 5 40.546 NaN NaN NaN 40.706 46 6 40.202 NaN NaN NaN 40.528 47 7 39.478 NaN NaN NaN 39.998 48 8 39.636 NaN NaN NaN 39.566 49 9 39.682 NaN NaN NaN 39.116 50 51 In [66]: df4.dropna(axis=1,) 52 Out[66]: 53 close open 54 0 41.314 44.148 55 1 40.978 41.500 56 2 39.186 40.600 57 3 39.954 38.852 58 4 41.040 39.990 59 5 40.546 40.706 60 6 40.202 40.528 61 7 39.478 39.998 62 8 39.636 39.566 63 9 39.682 39.116 64 65 66 In [67]: df4.fillna(0) 67 Out[67]: 68 close date high low open 69 0 41.314 0 0.0 0.0 44.148 70 1 40.978 0 0.0 0.0 41.500 71 2 39.186 0 0.0 0.0 40.600 72 3 39.954 0 0.0 0.0 38.852 73 4 41.040 0 0.0 0.0 39.990 74 5 40.546 0 0.0 0.0 40.706 75 6 40.202 0 0.0 0.0 40.528 76 7 39.478 0 0.0 0.0 39.998 77 8 39.636 0 0.0 0.0 39.566 78 9 39.682 0 0.0 0.0 39.116 79 80 In [68]: df4.isnull() 81 Out[68]: 82 close date high low open 83 0 False True True True False 84 1 False True True True False 85 2 False True True True False 86 3 False True True True False 87 4 False True True True False 88 5 False True True True False 89 6 False True True True False 90 7 False True True True False 91 8 False True True True False 92 9 False True True True False 93 94 In [69]: df4.notnull() 95 Out[69]: 96 close date high low open 97 0 True False False False True 98 1 True False False False True 99 2 True False False False True 100 3 True False False False True 101 4 True False False False True 102 5 True False False False True 103 6 True False False False True 104 7 True False False False True 105 8 True False False False True 106 9 True False False False True
pandas常用方法(适用于Series和DataFrame)
1 In [89]: df5 2 Out[89]: 3 id date open close high low volume code 4 0 0 2007-03-01 22.074 20.657 22.503 20.220 1977633.51 601318 5 1 1 2007-03-02 20.750 20.489 20.944 20.256 425048.32 601318 6 2 2 2007-03-05 20.300 19.593 20.384 19.218 419196.74 601318 7 3 3 2007-03-06 19.426 19.977 20.308 19.315 297727.88 601318 8 4 4 2007-03-07 19.995 20.520 20.706 19.827 287463.78 601318 9 5 5 2007-03-08 20.353 20.273 20.454 20.167 130983.83 601318 10 6 6 2007-03-09 20.264 20.101 20.353 19.735 160887.79 601318 11 7 7 2007-03-12 19.999 19.739 19.999 19.646 145353.06 601318 12 8 8 2007-03-13 19.783 19.818 19.982 19.699 102319.68 601318 13 9 9 2007-03-14 19.558 19.841 19.911 19.333 173306.56 601318 14 15 mean(axis=0,skipna=False) # 求平均值 16 17 In [90]: df5.mean() 18 Out[90]: 19 id 4.5000 20 open 20.2502 21 close 20.1008 22 high 20.5544 23 low 19.7416 24 volume 411992.1150 25 code 601318.0000 26 dtype: float64 27 28 In [91]: df5['open'].mean() 29 Out[91]: 20.2502 30 31 sum(axis=1) 32 33 In [93]: df5.sum() # 求和 34 Out[93]: 35 id 45 36 date 2007-03-012007-03-022007-03-052007-03-062007-0... 37 open 202.502 38 close 201.008 39 high 205.544 40 low 197.416 41 volume 4.11992e+06 42 code 6013180 43 dtype: object 44 45 sort_index(axis,ascending,...) #按行或列索引排序 46 sort_values(by,axis,ascending) # 按值排序 47 48 In [99]: df5.sort_index(axis=0) 49 Out[99]: 50 id date open close high low volume code 51 0 0 2007-03-01 22.074 20.657 22.503 20.220 1977633.51 601318 52 1 1 2007-03-02 20.750 20.489 20.944 20.256 425048.32 601318 53 2 2 2007-03-05 20.300 19.593 20.384 19.218 419196.74 601318 54 3 3 2007-03-06 19.426 19.977 20.308 19.315 297727.88 601318 55 4 4 2007-03-07 19.995 20.520 20.706 19.827 287463.78 601318 56 5 5 2007-03-08 20.353 20.273 20.454 20.167 130983.83 601318 57 6 6 2007-03-09 20.264 20.101 20.353 19.735 160887.79 601318 58 7 7 2007-03-12 19.999 19.739 19.999 19.646 145353.06 601318 59 8 8 2007-03-13 19.783 19.818 19.982 19.699 102319.68 601318 60 9 9 2007-03-14 19.558 19.841 19.911 19.333 173306.56 601318 61 62 63 In [102]: df5.sort_values(['close','open']) 64 Out[102]: 65 id date open close high low volume code 66 2 2 2007-03-05 20.300 19.593 20.384 19.218 419196.74 601318 67 7 7 2007-03-12 19.999 19.739 19.999 19.646 145353.06 601318 68 8 8 2007-03-13 19.783 19.818 19.982 19.699 102319.68 601318 69 9 9 2007-03-14 19.558 19.841 19.911 19.333 173306.56 601318 70 3 3 2007-03-06 19.426 19.977 20.308 19.315 297727.88 601318 71 6 6 2007-03-09 20.264 20.101 20.353 19.735 160887.79 601318 72 5 5 2007-03-08 20.353 20.273 20.454 20.167 130983.83 601318 73 1 1 2007-03-02 20.750 20.489 20.944 20.256 425048.32 601318 74 4 4 2007-03-07 19.995 20.520 20.706 19.827 287463.78 601318 75 0 0 2007-03-01 22.074 20.657 22.503 20.220 1977633.51 601318
1 # apply(func, axis=0) #将自定义函数应用在各行或者各列上,func可返回标量或者Series 2 #applymap(func) #将函数应用在DataFrame各个元素上 3 #map(func) 将函数应用在Series各个元素上 4 In [108]: df2 5 Out[108]: 6 open close high low volume 7 0 22.074 20.657 22.503 20.220 1977633.51 8 1 20.750 20.489 20.944 20.256 425048.32 9 2 20.300 19.593 20.384 19.218 419196.74 10 3 19.426 19.977 20.308 19.315 297727.88 11 4 19.995 20.520 20.706 19.827 287463.78 12 5 20.353 20.273 20.454 20.167 130983.83 13 6 20.264 20.101 20.353 19.735 160887.79 14 7 19.999 19.739 19.999 19.646 145353.06 15 8 19.783 19.818 19.982 19.699 102319.68 16 9 19.558 19.841 19.911 19.333 173306.56 17 18 In [110]: df2.apply(lambda x:x.sum()) 19 Out[110]: 20 open 202.502 21 close 201.008 22 high 205.544 23 low 197.416 24 volume 4119921.150 25 dtype: float64 26 27 In [109]: df2.applymap(lambda x:x+1) 28 Out[109]: 29 open close high low volume 30 0 23.074 21.657 23.503 21.220 1977634.51 31 1 21.750 21.489 21.944 21.256 425049.32 32 2 21.300 20.593 21.384 20.218 419197.74 33 3 20.426 20.977 21.308 20.315 297728.88 34 4 20.995 21.520 21.706 20.827 287464.78 35 5 21.353 21.273 21.454 21.167 130984.83 36 6 21.264 21.101 21.353 20.735 160888.79 37 7 20.999 20.739 20.999 20.646 145354.06 38 8 20.783 20.818 20.982 20.699 102320.68 39 9 20.558 20.841 20.911 20.333 173307.56
pandas: 层次化索引
层次化索引是pandas的一项重要功能,它使我们能够在一个轴上拥有多个索引级别
1 In [114]: import numpy as np 2 In [115]: data = pd.Series(np.random.rand(9),index=[['a','a','a','b','b','b','c','c','c'],[ 3 ...: 1,2,3,1,2,3,1,2,3]]) 4 5 In [116]: data 6 Out[116]: 7 a 1 0.445620 8 2 0.584242 9 3 0.454314 10 b 1 0.439814 11 2 0.714734 12 3 0.415314 13 c 1 0.491325 14 2 0.411385 15 3 0.617076 16 dtype: float64 17 18 In [118]: data['a'] 19 Out[118]: 20 1 0.445620 21 2 0.584242 22 3 0.454314 23 dtype: float64 24 25 In [119]: data['c'] 26 Out[119]: 27 1 0.491325 28 2 0.411385 29 3 0.617076 30 dtype: float64