内容学习自:
Python for Data Analysis, 2nd Edition
就是这本
纯英文学的很累,对不对取决于百度翻译了
前情提要:
各种方法贴:
https://www.cnblogs.com/baili-luoyun/p/10250177.html
内容提要:本次内容主要讲的是pands基本入门
一:pandas 主要有两种数据结构
Series,DataFrame
二: Series
1:定义:
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成
2:表现形式
Series的字符串表现形式为:索引在左边,值在右边。
3:创建一个一维数组
obj =pd.Series([4,5,6,7,8]) #创建一维数组 print(obj) print(obj.index) print(obj.values) >>>>>>>>> 0 4 1 5 2 6 3 7 4 8 dtype: int64 RangeIndex(start=0, stop=5, step=1) [4 5 6 7 8]
4:通过索引获得内容
1>:单索引
obj1 = pd.Series([4,6,-7,-8],index=['d','a','b','c']) #修改索引 print(obj1)
>>>>
#通过索引获得内容
print(obj1['d'])
>>>>
d 4
a 6
b -7
c -8
dtype: int64
4
2>:多索引
#多索引 print(obj1[['d','a','c']]) >>>> d 4 a 6 b -7 c -8 dtype: int64 d 4 a 6 c -8 dtype: int64
3>:布尔过滤
print(obj1[obj1<0])
>>>>
d 4
a 6
b -7
c -8
dtype: int64
b -7
c -8
dtype: int64
4>:应用乘法
print(obj1*2) >>>>>>>>>> d 4 a 6 b -7 c -8 dtype: int64 d 8 a 12 b -14 c -16 dtype: int64
5>:应用级函数
print(np.exp(obj1)) >>>>> d 4 a 6 b -7 c -8 dtype: int64 d 54.598150 a 403.428793 b 0.000912 c 0.000335 dtype: float64
6>:索引的映射关系
print('b'in obj1) print('e'in obj1) >>>>> d 4 a 6 b -7 c -8 dtype: int64 True False
5 :创建字典的Series:
1:>创建字典型Series
sdata ={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000 } obj3 =pd.Series(sdata) print(obj3) >>>> Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64
2:>Series 插入index 和valuse
sdata ={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000 } obj3 =pd.Series(sdata) print(obj3) # 插入index 和valuse states =['California','Ohio','Oregon','Texas'] obj4 =pd.Series(sdata,index=states) print(obj4) >>>>>>>>>>>>>> Ohio 35000 Texas 71000 Oregon 16000 Utah 5000 dtype: int64 California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
3>:检测数据是否缺失
l =pd.isnull(obj4) print(l) l2 =pd.notnull(obj4) print(l2) >>>>>>>>>>>> California True Ohio False Oregon False Texas False dtype: bool California False Ohio True Oregon True Texas True dtype: bool
4>:赋予名字
obj4.name ='population' obj4.index.name ='state' print(obj4) >>>>>>>>>\ state California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
5>:修改索引,修改索引的名字
obj =pd.Series([4,7,-6,3]) print(obj) obj.index=['bob','Steve','jeff','Ryan'] print(obj) >>>>>>>>> 0 4 1 7 2 -6 3 3 dtype: int64 bob 4 Steve 7 jeff -6 Ryan 3 dtype: int64
三:DataFrame
一:定义
data ={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'], 'year':[2000,2001,2002,2001,2002,2003], 'pop':[1.5,1.7,3.6,2.4,2.8,3.2] } frame =pd.DataFrame(data) print(frame) >>>>>>>>> state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.8 5 Nevada 2003 3.2
print(frame.head()) >>>>>>> state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.8
print(pd.DataFrame(data,columns=['year','pop','state'])) >>>>>>>> year pop state 0 2000 1.5 Ohio 1 2001 1.7 Ohio 2 2002 3.6 Ohio 3 2001 2.4 Nevada 4 2002 2.8 Nevada 5 2003 3.2 Nevada
2.3:拆入数据如果找不到,缺失值,则返回None
# #插入数据如果找不到,缺失值,则返回NaN #columns 列名 #index 行名 frame2 =pd.DataFrame(data,columns=['year','state','pop','debt'], index=['one','two','three','four','five','six'] ) print(frame2) >>>>>>>>>>>> year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.8 NaN six 2003 Nevada 3.2 NaN
2.4:返回columns 的值
print(frame2.columns) >>>>>>>> Index(['year', 'state', 'pop', 'debt'], dtype='object')
2.5:通过标记,或者属性的方式,获取某一列的值
# #单独获取某一列 print(frame2['state']) print(frame2.year) print('>>>>>>>>>>>>>>>>>>') print(frame2['year']) >>>>>>>>>>>>>> one Ohio two Ohio three Ohio four Nevada five Nevada six Nevada Name: state, dtype: object one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64 >>>>>>>>>>>>>>>>>> one 2000 two 2001 three 2002 four 2001 five 2002 six 2003 Name: year, dtype: int64
2.6:loc 属性获取行的所有内容
print(frame2.loc['three']) >>>>>>>>>> year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
2.7:通过赋值的方式进行修改
frame2['debt']=16.5 print(frame2) >>>>>>>> year state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.8 16.5 six 2003 Nevada 3.2 16.5
2.8:以 范围内容生成赋值
frame2['dabt']=np.arange(6.) print(frame2)
>>>>>>>>>>
year state pop debt dabt
one 2000 Ohio 1.5 NaN 0.0
two 2001 Ohio 1.7 NaN 1.0
three 2002 Ohio 3.6 NaN 2.0
four 2001 Nevada 2.4 NaN 3.0
five 2002 Nevada 2.8 NaN 4.0
six 2003 Nevada 3.2 NaN 5.0
2.9:以Series的方式进行赋值
print(frame2) print(">>>>>>>>>>>>") val =pd.Series([-1.2,-1.5,-1.7],index =['two','four','five']) print(val) print(">>>>>>>>>>>>>>") frame2['debt'] =val print(frame2) >>>>>>>>>>>>>>>>>>>>> year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.8 NaN six 2003 Nevada 3.2 NaN >>>>>>>>>>>> two -1.2 four -1.5 five -1.7 dtype: float64 >>>>>>>>>>>>>> year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.8 -1.7 six 2003 Nevada 3.2 NaN
2.10:布尔型运算
frame2['eastern'] =frame2.state =='Ohio' print(frame2) >>>>>>>> year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 NaN True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 NaN False five 2002 Nevada 2.8 NaN False six 2003 Nevada 3.2 NaN False