Pandas库学习笔记
import pandas as pd
两个数据类型:Series,DataFrame
pandas是基于Numpy实现的扩展库,提供了高效地操作大型数据集所需的工具。
Series类型由一组数据和与之相关的数据索引组成。
In [4]: d=pd.Series(range(5)) #自动索引 In [5]: d Out[5]: 0 0 1 1 2 2 3 3 4 4 dtype: int64 In [6]: d=pd.Series(range(5),index=['a','b','c','d','e']) #自定义索引 In [7]: d Out[7]: a 0 b 1 c 2 d 3 e 4 dtype: int64
直接传入字典:
In [10]: d=pd.Series({'a':1,'b':2,'c':3}) In [11]: d Out[11]: a 1 b 2 c 3 dtype: int64
或从ndarray类型创建:
In [16]: d=pd.Series(np.arange(5),index=np.arange(9,4,-1)) In [17]: d Out[17]: 9 0 8 1 7 2 6 3 5 4 dtype: int64
.index获取索引,.values获得数据值
DataFrame类型由共用相同索引的一组列构成,是一个表格行的数据类型,既有行索引,也有列索引,常用与表达二维数据。
1.从一维ndarray对象字典创建:
In [40]: d=pd.DataFrame(np.arange(20).reshape(4,5)) In [41]: d Out[41]: 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 In [43]: dt={'one':pd.Series([1,2,3],index=['a','b','c']), ...: 'two':pd.Series([9,8,7,6],index=['a','b','c','d'])} In [45]: d=pd.DataFrame(dt) In [46]: d Out[46]: one two a 1.0 9 b 2.0 8 c 3.0 7 d NaN 6 In [47]: pd.DataFrame(dt,index=['b','c','d'],columns=['two','three']) Out[47]: two three b 8 NaN c 7 NaN d 6 NaN
2.从列表类型的字典创建:
In [50]: dt={'one':[1,2,3,4],'two':[9,8,7,6]} In [51]: d=pd.DataFrame(dt,index=['a','b','c','d']) In [52]: d Out[52]: one two a 1 9 b 2 8 c 3 7 d 4 6
重新索引 .reindex():
In [2]: d1={'name':['Alice','Bob','Tony'], ...: 'gender':['f','m','m'], ...: 'age':[18,20,25]} In [5]: d=pd.DataFrame(d1,index=['c1','c2','c3']) In [6]: d Out[6]: age gender name c1 18 f Alice c2 20 m Bob c3 25 m Tony In [7]: d=d.reindex(['c3','c2','c1']) In [8]: d Out[8]: age gender name c3 25 m Tony c2 20 m Bob c1 18 f Alice In [9]: d=d.reindex(columns=['name','gender','age']) In [10]: d Out[10]: name gender age c3 Tony m 25 c2 Bob m 20 c1 Alice f 18
索引类型的常用方法:
In [11]: new1=d.columns.insert(3,'birthday') In [12]: new1 Out[12]: Index([u'name', u'gender', u'age', u'birthday'], dtype='object') In [17]: newd=d.reindex(columns=new1,fill_value='0101') In [18]: newd Out[18]: name gender age birthday c3 Tony m 25 0101 c2 Bob m 20 0101 c1 Alice f 18 0101
In [29]: newd.drop('c1') #drop和delete的区别 Out[29]: name gender age birthday c3 Tony m 25 0101 c2 Bob m 20 0101 In [32]: n=newd.index.delete(2) In [33]: newd=newd.reindex(index=n) In [34]: newd Out[34]: name gender age birthday c3 Tony m 25 0101 c2 Bob m 20 0101
.sort_index(axis=0,ascending=True) 根据索引进行排序,默认升序。
.sort_values()
基本的统计分析函数:
In [7]: b Out[7]: 0 1 2 3 4 c 0 1 2 3 4 b 5 6 7 8 9 a 10 11 12 13 14 d 15 16 17 18 19 In [8]: b.describe() Out[8]: 0 1 2 3 4 count 4.000000 4.000000 4.000000 4.000000 4.000000 mean 7.500000 8.500000 9.500000 10.500000 11.500000 std 6.454972 6.454972 6.454972 6.454972 6.454972 min 0.000000 1.000000 2.000000 3.000000 4.000000 25% 3.750000 4.750000 5.750000 6.750000 7.750000 50% 7.500000 8.500000 9.500000 10.500000 11.500000 75% 11.250000 12.250000 13.250000 14.250000 15.250000 max 15.000000 16.000000 17.000000 18.000000 19.000000 In [9]: type(b.describe()) Out[9]: pandas.core.frame.DataFrame In [10]: b.describe().ix['max'] Out[10]: 0 15.0 1 16.0 2 17.0 3 18.0 4 19.0 Name: max, dtype: float64 In [11]: b.describe()[2] Out[11]: count 4.000000 mean 9.500000 std 6.454972 min 2.000000 25% 5.750000 50% 9.500000 75% 13.250000 max 17.000000 Name: 2, dtype: float64
数据的相关性:
协方差:
对于两个事物X,Y ,
如果他们的协方差>0,X和Y正相关;
协方差<0,X和Y负相关;
协方差=0,独立无关。
.cov()
Pearson相关系数:
.corr()