利用python进行数据分析-05-pandas基础
1、Series
series是类似于一维数组的对象,它是由一组数据以及与之相关的数据标签(即索引)组成,仅由一组数据即可产生最简单的series:
p=pd.Series([1,2,4,3])
p
Out[5]:
0 1
1 2
2 4
3 3
dtype: int64p.valuesOut[6]: array([1, 2, 4, 3], dtype=int64)p.index
Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')
索引(index)在左,值(value)在右
p.values
Out[6]: array([1, 2, 4, 3], dtype=int64)
p.index
Out[7]: Int64Index([0, 1, 2, 3], dtype='int64')
可通过索引的方式来选取Series中的单个或一组值。
p[‘0’]
numpy数组之间的计算:(obj2)
obj2
Out[22]:
a 4
b 7
c -5
d 3
dtype: int64
obj2[obj2>0]
Out[26]:
a 4
b 7
d 3
dtype: int64
obj2 * 2
Out[27]:
a 8
b 14
c -10
d 6
dtype: int64
import numpy as np
np.exp(obj2)
Out[29]:
a 54.598150
b 1096.633158
c 0.006738
d 20.085537
dtype: float64
如果数据放在字典中,可以直接进行调用创建series
sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3 = pd.Series(sdata)
obj3
Out[34]:
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
如果只传入一个字典,那么Series中的索引就是字典中的键
states = ['California','Ohio','Oregon','Texas']
sdata = {'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj4 = pd.Series(sdata,index = states)
obj4
Out[12]:
California NaN
Ohio 35000
Oregon 16000
Texas 71000
dtype: float64
可用pandas的isnull和notnull函数来检查缺失数据
pd.isnull(obj4)
Out[13]:
California True
Ohio False
Oregon False
Texas False
dtype: bool
pd.notnull(obj4)
Out[14]:
California False
Ohio True
Oregon True
Texas True
dtype: bool
同理,obj4.isnull()也可以
series的name属性
obj4.name = 'population'
obj4.index.name = 'state'
obj4
Out[18]:
state
California NaN
Ohio 35000
Oregon 16000
Texas 71000
Name: population, dtype: float64
2、dataframe----表格型数据结构
含有一组有序的列,每列可以是不同的值类型(数值,字符串,布尔值等)。dataframe既有索引也有列索引,它可以被看作由Series组成的字典(共用同一个索引)
DataFrame以有序的列组成,每一列的个数要相等
构建DataFrame:
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
frame = pd.DataFrame(data)
frame
Out[41]:
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002
DataFrame会自动加上索引(跟Series一样),且全部列会被有序排列。
可自己指定序列,按照特点顺序进行排列
pd.DataFrame(data,columns=['year','state','pop'])
Out[46]:
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9
跟Series一样,如果传入的列在数据中找不到,会产生NA值。
frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])
frame2
Out[6]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaNframe2.columns Out[7]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
通过字典标记的方式或属性的方式,可以将DataFrame的列获取为一个Series:
frame2['state']
Out[8]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
frame2.state
Out[9]:
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
Name: state, dtype: object
返回的Series拥有和原DataFrame相同的索引,且其name属性也已经被相应地被设置好了。行也可以通过位置或名称的方式进行获取,比如用索引字段ix
data = {'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]}
frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])
frame2.ix['three'] #获取行的数据
Out[7]:
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
frame2['debt'] = 16.5 #更改数据中debt的值
frame2
Out[9]:
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
将列表或数组赋值给某个列时,其长度必须跟DataFrame的长度相匹配,如果赋值的是一个Series,就会精确匹配DataFrame的索引,所有的空位都会被填上缺失值:
frame2 = pd.DataFrame(data,columns = ['year','state','pop','debt'],
index = ['one','two','three','four','five'])
frame2['debt']=val
frame2
Out[16]:
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7
frame2['eastern'] = frame2.state == 'Ohio'
#为不存在的值创建一个新列。
frame2
Out[18]:
year state pop debt eastern
one 2000 Ohio 1.5 NaN True
two 2001 Ohio 1.7 -1.2 True
three 2002 Ohio 3.6 NaN True
four 2001 Nevada 2.4 -1.5 False
five 2002 Nevada 2.9 -1.7 False
del frame2['eastern']
#关键字del 用于删除列
frame2.columns
Out[20]: Index(['year', 'state', 'pop', 'debt'], dtype='object')
3、嵌套字典
pop={'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
若将上述字典传给DataFrame,它会被解释为:外层字典的键作为列,内层键则作为行索引:
frame3 = pd.DataFrame(pop)
frame3
Out[23]:
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
可对结果进行转置:
frame3.T
Out[24]:
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
内层字典的键会被合并,排序以形成最终的索引。如果显示指定了索引,则不会这样:
DataFrame 的index和columns的属性
frame3.index.name = 'year';frame3.columns.name = 'state'
frame3
Out[27]:
state Nevada Ohio
year
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
跟series一样,values属性也会以二维ndarray的形式返回DataFrame中的数据:
frame3.values
Out[28]:
array([[ nan, 1.5],
[ 2.4, 1.7],
[ 2.9, 3.6]])
4、索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或其他序列的标签都会转换成一个Index:
obj=pd.Series(range(3),index = ['a','b','c'])
index = obj.index
index
Out[33]: Index(['a', 'b', 'c'], dtype='object')
index[1:]
Out[34]: Index(['b', 'c'], dtype='object')#index是不能修改的(immutable)