pandas 数据结构的基本功能

操作Series和DataFrame中的数据的常用方法:

导入python库:

import numpy as np
import pandas as pd

测试的数据结构:

Series:

>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

DataFrame:

>>> data = {
...     'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
...     'year': [2000, 2001, 2002, 2001, 2002],
...     'pop': [1.5, 1.7, 3.6, 2.4, 2.9]
... }
>>> frame = pd.DataFrame(data)
>>> frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

 

重新索引 reindex():

  创建一个适应新索引的新对象:

  对于Series来说,只有列索引(数据标签):

  调用该Series的reindex将会根据新索引进行重排。如果某个索引值当前不存在,就引入缺失值

  例:将 ['d', 'b', 'a', 'c'] 替换为 ['a', 'b', 'c', 'd', 'e']   e不存在 ,自动引入缺失值NaN,可以使用fill_value手动选择缺失值

>>> obj.reindex(['a', 'b', 'c', 'd', 'e'])
a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
>>> obj.reindex(['a', 'b', 'c', 'd', 'e'],fill_value=666)
a     -5.3
b      7.2
c      3.6
d      4.5
e    666.0
dtype: float64

  对于DataFrame来说,既有行索引也有列索引,默认是行索引,但也可同时进行重新索引(使用方法看例子和输出结果)。

  例:需要注意的是,int和str的区别,默认的索引类型是int型,

>>> frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
>>> frame.reindex([4,3,2,1,0])
   pop   state  year
4  2.9  Nevada  2002
3  2.4  Nevada  2001
2  3.6    Ohio  2002
1  1.7    Ohio  2001
0  1.5    Ohio  2000
>>> frame.reindex(['4','3','2','1','0'])
   pop state  year
4  NaN   NaN   NaN
3  NaN   NaN   NaN
2  NaN   NaN   NaN
1  NaN   NaN   NaN
0  NaN   NaN   NaN
>>> frame.reindex(['a', 'b', 'c', 'd', 'e'])
   pop state  year
a  NaN   NaN   NaN
b  NaN   NaN   NaN
c  NaN   NaN   NaN
d  NaN   NaN   NaN
e  NaN   NaN   NaN
>>> frame.reindex([4,3,2,1,0],columns=['year', 'state', 'pop'])
   year   state  pop
4  2002  Nevada  2.9
3  2001  Nevada  2.4
2  2002    Ohio  3.6
1  2001    Ohio  1.7
0  2000    Ohio  1.5
>>> frame.reindex(index=[4,3,2,1,0],columns=['year', 'state', 'pop'])
   year   state  pop
4  2002  Nevada  2.9
3  2001  Nevada  2.4
2  2002    Ohio  3.6
1  2001    Ohio  1.7
0  2000    Ohio  1.5

删除指定行/列的项:

  对于Series来说,只有列的概念:

>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj.drop(['d','a'])
b    7.2
c    3.6
dtype: float64

  对于DataFrame来说,既有行也有列,默认是删除行,删除列时设置axis为1, 否则会报错(使用方法看例子和输出结果)。

   

>>> frame
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
>>> frame.drop([0,1])
   pop   state  year
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002
>>> frame.drop(['pop'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 2530, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/generic.py", line 2562, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3744, in drop
    labels[mask])
ValueError: labels ['pop'] not contained in axis
>>> frame.drop(['pop'],axis=1)
    state  year
0    Ohio  2000
1    Ohio  2001
2    Ohio  2002
3  Nevada  2001
4  Nevada  2002

 

索引 ,选取,过滤:

  Series:

    选取:

      series的选取类似于list;不同的是 series既可以使用数字索引选取,也可以使用自定标签索引选取。

>>> obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
>>> obj
d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj['d']
4.5
>>> obj[0]
4.5

    赋值:赋值:

      与选取类似。

>>> obj['d'] = 0
>>> obj['d']
0.0
>>> obj
d    0.0
b    7.2
a   -5.3
c    3.6
dtype: float64
>>> obj[0] = 88
>>> obj
d    88.0
b     7.2
a    -5.3
c     3.6
dtype: float64

  DataFrame:

    选取:

      DataFrame默认的索引指的是列索引,并且只能使用列标签索引,不能使用数字索引会报错(返回Series对象)。

      DataFrame可以使用切片功能来进行 行索引选取(返回DataFrame对象)。

      DataFrame也可以使用DataFrame.ix[val]来进行具体选取(返回Series对象)。使用方法:frame.ix[0]返回第一行的Series对象。frame.ix[1,['year']]返回第二行,第year列的Series对象。

例:列索引

>>> frame
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
>>> frame['year']
0    2000
1    2001
2    2002
3    2001
4    2002
Name: year, dtype: int64
>>> frame[0]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

例:行索引

>>> frame
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
>>> frame[0:2]
   year state  pop
0  2000  Ohio  1.5
1  2001  Ohio  1.7
>>> frame[0:1]
   year state  pop
0  2000  Ohio  1.5
>>> frame.ix[0]
year     2000
state    Ohio
pop       1.5
Name: 0, dtype: object

例:ix索引

>>> frame.ix[0]
year     2000
state    Ohio
pop       1.5
Name: 0, dtype: object
>>> frame.ix[1,['year']]
year    2001
Name: 1, dtype: object

例:返回格式

>>> type(frame['year'])
<class 'pandas.core.series.Series'>


>>> type(frame[0:2])
<class 'pandas.core.frame.DataFrame'>


>>> type(frame.ix[0])
<class 'pandas.core.series.Series'>

>>> type(frame.ix[0,['year']])
<class 'pandas.core.series.Series'>

     赋值:

例:DataFrame赋值

#frame
>>> frame
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
#对frame列赋值非list是会对整列赋值
>>> frame['year'] = 5
>>> frame
   year   state  pop
0     5    Ohio  1.5
1     5    Ohio  1.7
2     5    Ohio  3.6
3     5  Nevada  2.4
4     5  Nevada  2.9
>>> frame['year'] = 'test'
>>> frame
   year   state  pop
0  test    Ohio  1.5
1  test    Ohio  1.7
2  test    Ohio  3.6
3  test  Nevada  2.4
4  test  Nevada  2.9

#对frame列赋值进行list整列赋值是必须保证list长度等于行的长度。
>>> frame['year'] = range(5)
>>> frame
   year   state  pop
0     0    Ohio  1.5
1     1    Ohio  1.7
2     2    Ohio  3.6
3     3  Nevada  2.4
4     4  Nevada  2.9
>>> frame['year'] = range(4)
Traceback (most recent call last):
ValueError: Length of values does not match length of index



#行赋值
>>> frame.ix[0] = 5
>>> frame
   year   state  pop
0     5       5  5.0
1     1    Ohio  1.7
2     2    Ohio  3.6
3     3  Nevada  2.4
4     4  Nevada  2.9

 

 

 

 算术运算:

 

      

posted @ 2017-12-27 09:11  Jansora  阅读(3853)  评论(0编辑  收藏  举报