pandas
numpy处理数值,但是pandas除了处理数值之外(基于numpy),还能处理其他类型数据。 pandas常用数据类型 Series一维,带标签数组 DataFrame二维,Series容器
Series创建
In [97]: import pandas as pd In [98]: pd.Series([1,2,3,4,5]) # 第一列对象键(index, 索引),第二列对象值(values) Out[98]: 0 1 1 2 2 3 3 4 4 5 dtype: int64 In [99]: t = pd.Series([1,2,3,4,5]) In [100]: type(t) Out[100]: pandas.core.series.Series In [101]: t2 = pd.Series([1,2,3,4],index=list('abcd')) # index指定索引 In [102]: t2 Out[102]: a 1 b 2 c 3 d 4 dtype: int64 In [105]: pd.Series(np.arange(2)) Out[105]: 0 0 1 1 dtype: int64 In [107]: t3 = pd.Series({'name':'xyp', 'age':18}) In [108]: t3 Out[108]: name xyp age 18 dtype: object In [109]: t3.dtype Out[109]: dtype('O') # o代表object In [110]: t2.dtype Out[110]: dtype('int64') In [111]: t2.astype(float) Out[111]: a 1.0 b 2.0 c 3.0 d 4.0 dtype: float64
Series切片和索引
In [118]: t3 = pd.Series({'name':'xyp','age':18,'tel':10086}) In [119]: t3 Out[119]: name xyp age 18 tel 10086 dtype: object In [120]: t3[0] Out[120]: 'xyp' In [121]: t3['name'] Out[121]: 'xyp' In [122]: t3[[1,2]] Out[122]: age 18 tel 10086 dtype: object In [123]: t3[[0,2]] Out[123]: name xyp tel 10086 dtype: object In [125]: t3[['age','tel']] Out[125]: age 18 tel 10086 dtype: object In [128]: t3 Out[128]: name xyp age 18 tel 10086 dtype: object In [129]: t3.index Out[129]: Index(['name', 'age', 'tel'], dtype='object') In [130]: for i in t3.index: ...: print(i) ...: name age tel In [131]: type(t3.index) Out[131]: pandas.core.indexes.base.Index In [132]: len(t3.index) Out[132]: 3 In [133]: list(t3.index) Out[133]: ['name', 'age', 'tel'] In [136]: s = pd.Series(range(5)) In [137]: s.where(s>0) Out[137]: 0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 In [139]: s.where(s>1, 10) # s元素 = s元素 s元素>1 else 10,和numpy用法相反 Out[139]: 0 10 1 10 2 2 3 3 4 4 dtype: int64 In [126]: t Out[126]: 0 1 1 2 2 3 3 4 4 5 dtype: int64 In [127]: t[t>3] Out[127]: 3 4 4 5 dtype: int64
pandas读取外部文件
pd.read_csv pd.read_excel pd.read_sql mongodb数据和mysql不同,需先连接mongodb取出数据然后放入pd.Series(data)中处理 ......
DataFrame创建
In [4]: pd.DataFrame(np.arange(12).reshape(3,4)) # 第一行为行索引,第一列为列索引 Out[4]: 0 1 2 3 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11 In [6]: pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ')) Out[6]: W X Y Z a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 In [7]: t1 = pd.DataFrame(np.arange(12).reshape(3,4),index=list('abc'),columns=list('WXYZ')) In [8]: t1 Out[8]: W X Y Z a 0 1 2 3 b 4 5 6 7 c 8 9 10 11 In [9]: d1 = {'name':['xyp','oynn'],'age':[18,20],'tel':[10086,10087]} In [10]: pd.DataFrame(d1) Out[10]: name age tel 0 xyp 18 10086 1 oynn 20 10087 In [11]: t1 = pd.DataFrame(d1) In [12]: type(t1) Out[12]: pandas.core.frame.DataFrame In [13]: d2 = [{'name':'xyp','age':18,'tel':10086},{'name':'oynn','age':20},{'age':16,'tel':10087}] In [14]: t2 = pd.DataFrame(d2) In [15]: t2 Out[15]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 16 10087.0
DataFrame基础属性
In [15]: t2 Out[15]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 16 10087.0 In [16]: t2.index Out[16]: RangeIndex(start=0, stop=3, step=1) In [17]: t2.columns Out[17]: Index(['name', 'age', 'tel'], dtype='object') In [18]: t2.values Out[18]: array([['xyp', 18, 10086.0], ['oynn', 20, nan], [nan, 16, 10087.0]], dtype=object) In [19]: t2.shape Out[19]: (3, 3) In [20]: t2.dtypes Out[20]: name object age int64 tel float64 dtype: object In [21]: t2.ndim # t2维度 Out[21]: 2 In [22]: t2.head(2) # 显示头部几行,默认5行 Out[22]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN In [23]: t2.tail(2) # 显示末尾几行,默认5行 Out[23]: name age tel 1 oynn 20 NaN 2 NaN 16 10087.0 In [24]: t2.info() # 相关信息概览:行数,列数,列索引,列非空值个数,列类型,内存占用 Out[24]: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 # 3行,索引0 to 2 Data columns (total 3 columns): name 2 non-null object age 3 non-null int64 tel 2 non-null float64 dtypes: float64(1), int64(1), object(1) memory usage: 200.0+ bytes # 内存 In [25]: t2.describe() # 快速综合统计结果 Out[25]: age tel count 3.0 2.000000 # 计数 mean 18.0 10086.500000 # 均值 std 2.0 0.707107 # 标准差 min 16.0 10086.000000 # 最小值 25% 17.0 10086.250000 # 四分之一中位数 50% 18.0 10086.500000 # 中位数 75% 19.0 10086.750000 # 四分之三中位数 max 20.0 10087.000000 # 最大值
DataFrame排序方法
In [27]: t2 Out[27]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 16 10087.0 In [30]: t2.sort_values(by='age') Out[30]: name age tel 2 NaN 16 10087.0 0 xyp 18 10086.0 1 oynn 20 NaN In [31]: t2.sort_values(by='age',ascending=False) Out[31]: name age tel 1 oynn 20 NaN 0 xyp 18 10086.0 2 NaN 16 10087.0 In [32]: df = t2.sort_values(by='age',ascending=False) In [33]: df.head(2) Out[33]: name age tel 1 oynn 20 NaN 0 xyp 18 10086.0
DataFrame的索引
In [33]: df Out[33]: name age tel 1 oynn 20 NaN 0 xyp 18 10086.0 2 NaN 16 10087.0 In [34]: df[:2] Out[34]: name age tel 1 oynn 20 NaN 0 xyp 18 10086.0 In [35]: df[:2]['age'] Out[35]: 1 20 0 18 Name: age, dtype: int64 In [36]: df['name'] Out[36]: 1 oynn 0 xyp 2 NaN Name: name, dtype: object In [37]: type(df['name']) Out[37]: pandas.core.series.Series
loc通过标签索引获取数据
In [48]: t2 Out[48]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 16 10087.0 In [49]: t2.loc[0,'name'] # loc[a,b] a为行索引,b为列索引 Out[49]: 'xyp' In [50]: t2.loc[1,'age'] Out[50]: 20 In [51]: type(t2.loc[1,'age']) Out[51]: numpy.int64 In [55]: t2.loc[2] Out[55]: name NaN age 16 tel 10087 Name: 2, dtype: object In [56]: t2.loc[2,:] Out[56]: name NaN age 16 tel 10087 Name: 2, dtype: object In [57]: t2.loc[:,'age'] Out[57]: 0 18 1 20 2 16 Name: age, dtype: int64 In [58]: t2.loc[[0,2],:] Out[58]: name age tel 0 xyp 18 10086.0 2 NaN 16 10087.0 In [59]: t2.loc[:,['name','age']] Out[59]: name age 0 xyp 18 1 oynn 20 2 NaN 16 In [62]: t2.loc[[1,2],['name','tel']] Out[62]: name tel 1 oynn NaN 2 NaN 10087.0 In [63]: type(t2.loc[[1,2],['name','tel']]) Out[63]: pandas.core.frame.DataFrame In [64]: t2.loc[0:2,['name','tel']] # loc取索引,能取到索引最后一个数据 Out[64]: name tel 0 xyp 10086.0 1 oynn NaN 2 NaN 10087.0 In [64]: t2.loc[0:2,'name':'tel'] # loc取索引,能取到索引最后一个数据 Out[64]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 16 10087.0
iloc通过位置索引获取数据
In [66]: t2.iloc[0:2,[0,1]] Out[66]: name age 0 xyp 18 1 oynn 20 In [67]: t2.iloc[0:2,0:2] # iloc取索引不包括最后一个数据 Out[67]: name age 0 xyp 18 1 oynn 20 In [73]: t2.iloc[:,1] = 18 In [74]: t2 Out[74]: name age tel 0 xyp 18 10086.0 1 oynn 18 NaN 2 NaN 18 10087.0
布尔索引
In [79]: t2 Out[79]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 22 10087.0 In [81]: t2[t2['age']>18] Out[81]: name age tel 1 oynn 20 NaN 2 NaN 22 10087.0 In [84]: t2[(t2['age']>18) & (t2['age']<22)] Out[84]: name age tel 1 oynn 20 NaN In [85]: t2[(t2['age']>18) | (t2['age']<22)] Out[85]: name age tel 0 xyp 18 10086.0 1 oynn 20 NaN 2 NaN 22 10087.0
字符串方法
In [95]: t2 Out[95]: name age tel 0 xyp 18 10086,10087 1 oynn 20 NaN 2 NaN 22 10086,10087 In [96]: t2['tel'].str.split(',') Out[96]: 0 [10086, 10087] 1 NaN 2 [10086, 10087] Name: tel, dtype: object ....
pandas缺失数据处理
In [99]: t2 Out[99]: name age tel 0 xyp 18 10086,10087 1 oynn 20 NaN 2 NaN 22 10086,10087 In [101]: pd.isnull(t2) # 判断数据是否有NaN Out[101]: name age tel 0 False False False 1 False False True 2 True False False In [102]: pd.notnull(t2) # 判断数据是否有NaN Out[102]: name age tel 0 True True True 1 True True False 2 False True True 1. 删除NaN所在的行列 In [104]: t2.dropna(axis=0) # 默认为dropna(axis=0, how='any', inplace=False), how='all'时为数据均为NaN才会删除,inplace=True时直接修改t2数组,不需要重新赋值就能改变t2数组 Out[104]: name age tel 0 xyp 18 10086,10087 In [106]: t2.dropna(axis=0, how='all') Out[106]: name age tel 0 xyp 18 10086,10087 1 oynn 20 NaN 2 NaN 22 10086,10087 2. 填充数据 In [107]: t2 Out[107]: name age tel 0 xyp 18 10086,10087 1 oynn 20 NaN 2 NaN 22 10086,10087 In [111]: t2.fillna(t2.mean()) # t2.mean() nan不参与计算均值。name,tel不是数值所以填充失败。 Out[111]: name age tel 0 xyp 18 10086,10087 1 oynn 20 NaN 2 NaN 22 10086,10087 In [112]: t2.fillna(0) Out[112]: name age tel 0 xyp 18 10086,10087 1 oynn 20 0 2 0 22 10086,10087
import pandas as pd file_path = 'IMDB-Movie-Data.csv' df = pd.read_csv(file_path) # 获取电影平均评分 print(df['Rating'].mean()) # 6.7232 # 导演人数 print(len(set(df['Director'].tolist()))) # 644 print(len(df['Director'].unique())) # 644 # 获取演员人数 temp_actors_list = df['Actors'].str.split(',').tolist() actors_list = [j for i in temp_actors_list for j in i] # 双重循环推导式 actors_num = len(set(actors_list)) print(actors_num) # 2394
数据合并
# 数组合并join,行合并 In [1]: import pandas as pd In [2]: import numpy as np In [3]: df1 = pd.DataFrame(np.ones((2,4)),index=['A','B'],columns=list('abcd')) In [4]: df1 Out[4]: a b c d A 1.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 In [5]: df2 = pd.DataFrame(np.zeros((3,3)),index=['A','B','C'],columns=list('xyz')) In [6]: df2 Out[6]: x y z A 0.0 0.0 0.0 B 0.0 0.0 0.0 C 0.0 0.0 0.0 In [7]: df1.join(df2) Out[7]: a b c d x y z A 1.0 1.0 1.0 1.0 0.0 0.0 0.0 B 1.0 1.0 1.0 1.0 0.0 0.0 0.0 In [8]: df2.join(df1) Out[8]: x y z a b c d A 0.0 0.0 0.0 1.0 1.0 1.0 1.0 B 0.0 0.0 0.0 1.0 1.0 1.0 1.0 C 0.0 0.0 0.0 NaN NaN NaN NaN # 数组合并merge,列合并 In [12]: df1 Out[12]: a b c d A 1.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 In [14]: df3 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list('fax')) In [15]: df3 Out[15]: f a x 0 0 1 2 1 3 4 5 2 6 7 8 In [16]: df1.merge(df3,on='a') # on交集合并,按照a列合并,df1中a列坐标(A,a)和(B,a),和df3中a列坐标(0,a)相等,df1中共2行,合并成2行 Out[16]: a b c d f x 0 1.0 1.0 1.0 1.0 0 2 1 1.0 1.0 1.0 1.0 0 2 In [17]: df1.loc['A','a'] = 100 In [18]: df1 Out[18]: a b c d A 100.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 In [19]: df1.merge(df3,on='a') # on交集合并,按照a列合并,df1中a列坐标(B,a),和df3中a列坐标(0,a)相等,df1中共1行,合并成1行 Out[19]: a b c d f x 0 1.0 1.0 1.0 1.0 0 2 In [20]: df1.merge(df3,on='a',how='inner') # inner内连接,交集 Out[20]: a b c d f x 0 1.0 1.0 1.0 1.0 0 2 In [21]: df1.merge(df3,on='a',how='outer') # outer外连接,并集 Out[21]: a b c d f x 0 100.0 1.0 1.0 1.0 NaN NaN 1 1.0 1.0 1.0 1.0 0.0 2.0 2 4.0 NaN NaN NaN 3.0 5.0 3 7.0 NaN NaN NaN 6.0 8.0
分组聚合
df.groupby(by='') # by值可以为字符串,单个字段分组;by值为[df[''], df[''], ...]多个字段分组 ......
索引和复合索引
In [5]: df1 Out[5]: a b c d A 1.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 In [6]: df1.index # 获取索引 Out[6]: Index(['A', 'B'], dtype='object') In [8]: df1.index = ['a', 'b'] # 指定索引 In [9]: df1 Out[9]: a b c d a 1.0 1.0 1.0 1.0 b 1.0 1.0 1.0 1.0 In [10]: df1.index Out[10]: Index(['a', 'b'], dtype='object') In [11]: df1.reindex(['a','f']) # 重新设置索引,用的比较少 Out[11]: a b c d a 1.0 1.0 1.0 1.0 f NaN NaN NaN NaN In [12]: df1 Out[12]: a b c d a 1.0 1.0 1.0 1.0 b 1.0 1.0 1.0 1.0 In [13]: df1.set_index('a') # 指定某一列作为索引 Out[13]: b c d a 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 In [14]: df1.set_index('a',drop=False) # drop=False不删除指定索引列 Out[14]: a b c d a 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 In [31]: df1 Out[31]: a b c d A 100.0 1.0 1.0 1.0 B 1.0 1.0 1.0 1.0 In [32]: df1['d'].unique() # 对某一列来取值,且值不重复 Out[32]: array([1.]) In [33]: df1['a'].unique() Out[33]: array([100., 1.]) In [36]: df1.set_index('b').index.unique() Out[36]: Float64Index([1.0], dtype='float64', name='b') In [37]: len(df1.set_index('b').index.unique()) Out[37]: 1 In [38]: len(df1.set_index('b').index) Out[38]: 2 In [39]: list(df1.set_index('b').index) Out[39]: [1.0, 1.0]
时间序列
# 时间序列 In [40]: pd.date_range(start='20190831',end='20190930',freq='D') Out[40]: DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07', '2019-09-08', '2019-09-09', '2019-09-10', '2019-09-11', '2019-09-12', '2019-09-13', '2019-09-14', '2019-09-15', '2019-09-16', '2019-09-17', '2019-09-18', '2019-09-19', '2019-09-20', '2019-09-21', '2019-09-22', '2019-09-23', '2019-09-24', '2019-09-25', '2019-09-26', '2019-09-27', '2019-09-28', '2019-09-29', '2019-09-30'], dtype='datetime64[ns]', freq='D') In [41]: pd.date_range(start='20190831',end='20190930',freq='10D') Out[41]: DatetimeIndex(['2019-08-31', '2019-09-10', '2019-09-20', '2019-09-30'], dtype='datetime64[ns]', freq='10D') In [42]: pd.date_range(start='20190831',periods=10,freq='D') Out[42]: DatetimeIndex(['2019-08-31', '2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07', '2019-09-08', '2019-09-09'], dtype='datetime64[ns]', freq='D') In [43]: pd.date_range(start='20190831',periods=10,freq='M') Out[43]: DatetimeIndex(['2019-08-31', '2019-09-30', '2019-10-31', '2019-11-30', '2019-12-31', '2020-01-31', '2020-02-29', '2020-03-31', '2020-04-30', '2020-05-31'], dtype='datetime64[ns]', freq='M') pd.to_datetime(df['timeStamp'],format='') # resample重采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。 降采样:高频数据到低频数据 升采样:低频数据到高频数据 df.resample('D') # PeriodIndex重组时间序列,主要将数据中的分离的时间字段,重组为时间序列,并指定为index period = pd.PeriodIndex(year=df['year'],month=df['month'],day=df['day'],hour=df['hour'],freq='H')