pandas 数据结构的基本功能
操作Series和DataFrame中的数据的常用方法:
导入python库:
import numpy as np import pandas as pd
测试的数据结构:
Series:
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) >>> obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64
DataFrame:
>>> data = { ... 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], ... 'year': [2000, 2001, 2002, 2001, 2002], ... 'pop': [1.5, 1.7, 3.6, 2.4, 2.9] ... } >>> frame = pd.DataFrame(data) >>> frame pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
重新索引 reindex():
创建一个适应新索引的新对象:
对于Series来说,只有列索引(数据标签):
调用该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值
例:将 ['d', 'b', 'a', 'c'] 替换为 ['a', 'b', 'c', 'd', 'e'] e不存在 ,自动引入缺失值NaN,可以使用fill_value手动选择缺失值
>>> obj.reindex(['a', 'b', 'c', 'd', 'e']) a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64 >>> obj.reindex(['a', 'b', 'c', 'd', 'e'],fill_value=666) a -5.3 b 7.2 c 3.6 d 4.5 e 666.0 dtype: float64
对于DataFrame来说,既有行索引也有列索引,默认是行索引,但也可同时进行重新索引(使用方法看例子和输出结果)。
例:需要注意的是,int和str的区别,默认的索引类型是int型,
>>> frame pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 >>> frame.reindex([4,3,2,1,0]) pop state year 4 2.9 Nevada 2002 3 2.4 Nevada 2001 2 3.6 Ohio 2002 1 1.7 Ohio 2001 0 1.5 Ohio 2000 >>> frame.reindex(['4','3','2','1','0']) pop state year 4 NaN NaN NaN 3 NaN NaN NaN 2 NaN NaN NaN 1 NaN NaN NaN 0 NaN NaN NaN >>> frame.reindex(['a', 'b', 'c', 'd', 'e']) pop state year a NaN NaN NaN b NaN NaN NaN c NaN NaN NaN d NaN NaN NaN e NaN NaN NaN >>> frame.reindex([4,3,2,1,0],columns=['year', 'state', 'pop']) year state pop 4 2002 Nevada 2.9 3 2001 Nevada 2.4 2 2002 Ohio 3.6 1 2001 Ohio 1.7 0 2000 Ohio 1.5 >>> frame.reindex(index=[4,3,2,1,0],columns=['year', 'state', 'pop']) year state pop 4 2002 Nevada 2.9 3 2001 Nevada 2.4 2 2002 Ohio 3.6 1 2001 Ohio 1.7 0 2000 Ohio 1.5
删除指定行/列的项:
对于Series来说,只有列的概念:
>>> obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 >>> obj.drop(['d','a']) b 7.2 c 3.6 dtype: float64
对于DataFrame来说,既有行也有列,默认是删除行,删除列时设置axis为1, 否则会报错(使用方法看例子和输出结果)。
>>> frame pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 >>> frame.drop([0,1]) pop state year 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002 >>> frame.drop(['pop']) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 2530, in drop obj = obj._drop_axis(labels, axis, level=level, errors=errors) File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 2562, in _drop_axis new_axis = axis.drop(labels, errors=errors) File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3744, in drop labels[mask]) ValueError: labels ['pop'] not contained in axis >>> frame.drop(['pop'],axis=1) state year 0 Ohio 2000 1 Ohio 2001 2 Ohio 2002 3 Nevada 2001 4 Nevada 2002
索引 ,选取,过滤:
Series:
选取:
series的选取类似于list;不同的是 series既可以使用数字索引选取,也可以使用自定标签索引选取。
>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c']) >>> obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 >>> obj['d'] 4.5 >>> obj[0] 4.5
赋值:赋值:
与选取类似。
>>> obj['d'] = 0 >>> obj['d'] 0.0 >>> obj d 0.0 b 7.2 a -5.3 c 3.6 dtype: float64 >>> obj[0] = 88 >>> obj d 88.0 b 7.2 a -5.3 c 3.6 dtype: float64
DataFrame:
选取:
DataFrame默认的索引指的是列索引,并且只能使用列标签索引,不能使用数字索引会报错(返回Series对象)。
DataFrame可以使用切片功能来进行 行索引选取(返回DataFrame对象)。
DataFrame也可以使用DataFrame.ix[val]来进行具体选取(返回Series对象)。使用方法:frame.ix[0]返回第一行的Series对象。frame.ix[1,['year']]返回第二行,第year列的Series对象。
例:列索引
>>> frame year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 >>> frame['year'] 0 2000 1 2001 2 2002 3 2001 4 2002 Name: year, dtype: int64 >>> frame[0] Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 0
例:行索引
>>> frame year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 >>> frame[0:2] year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 >>> frame[0:1] year state pop 0 2000 Ohio 1.5 >>> frame.ix[0] year 2000 state Ohio pop 1.5
Name: 0, dtype: object
例:ix索引
>>> frame.ix[0] year 2000 state Ohio pop 1.5 Name: 0, dtype: object >>> frame.ix[1,['year']] year 2001 Name: 1, dtype: object
例:返回格式
>>> type(frame['year']) <class 'pandas.core.series.Series'> >>> type(frame[0:2]) <class 'pandas.core.frame.DataFrame'> >>> type(frame.ix[0]) <class 'pandas.core.series.Series'> >>> type(frame.ix[0,['year']]) <class 'pandas.core.series.Series'>
赋值:
例:DataFrame赋值
#frame >>> frame year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9 #对frame列赋值非list是会对整列赋值 >>> frame['year'] = 5 >>> frame year state pop 0 5 Ohio 1.5 1 5 Ohio 1.7 2 5 Ohio 3.6 3 5 Nevada 2.4 4 5 Nevada 2.9 >>> frame['year'] = 'test' >>> frame year state pop 0 test Ohio 1.5 1 test Ohio 1.7 2 test Ohio 3.6 3 test Nevada 2.4 4 test Nevada 2.9 #对frame列赋值进行list整列赋值是必须保证list长度等于行的长度。 >>> frame['year'] = range(5) >>> frame year state pop 0 0 Ohio 1.5 1 1 Ohio 1.7 2 2 Ohio 3.6 3 3 Nevada 2.4 4 4 Nevada 2.9 >>> frame['year'] = range(4) Traceback (most recent call last): ValueError: Length of values does not match length of index #行赋值 >>> frame.ix[0] = 5 >>> frame year state pop 0 5 5 5.0 1 1 Ohio 1.7 2 2 Ohio 3.6 3 3 Nevada 2.4 4 4 Nevada 2.9
算术运算: