python-pandas的一些小练习——东北大学大数据班数据挖掘实训Python基础二(2)

目录

因为本博客是直接导入markdown文件(jupter写的)生成的,所以会导致分不清代码和输出,如果大家需要清楚的代码和输出,就请下载以下html文件
pandasTrain.html

实训练习:练习使用pandas 包中的dataframe和方法。

1将数据读取为dataframe类型,命名为df

import pandas as pd
import numpy as np
df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\data\\flight.csv")
df.head()
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist
0 2011/12/1 14.0 31.0 1431.0 1730.0 -4.0 -9.0 XE 4319 SAV N11121 0 104.0 851
1 2011/12/1 21.0 15.0 2115.0 2229.0 -10.0 -7.0 XE 4342 LRD N11107 0 50.0 301
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434 RIC N11113 0 146.0 1157
3 2011/12/1 19.0 24.0 1924.0 2234.0 -1.0 1.0 XE 4525 SAV N11106 0 105.0 851
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602 DTW N11121 0 139.0 1075

2查看数据的形状[了解数据有多少行、多少列]

df.shape
(174, 14)

3查看数据的最开始的7条数据和尾部的3条数据

df.head(7)
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist
0 2011/12/1 14.0 31.0 1431.0 1730.0 -4.0 -9.0 XE 4319 SAV N11121 0 104.0 851
1 2011/12/1 21.0 15.0 2115.0 2229.0 -10.0 -7.0 XE 4342 LRD N11107 0 50.0 301
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434 RIC N11113 0 146.0 1157
3 2011/12/1 19.0 24.0 1924.0 2234.0 -1.0 1.0 XE 4525 SAV N11106 0 105.0 851
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602 DTW N11121 0 139.0 1075
5 2011/12/1 20.0 18.0 2018.0 2200.0 -2.0 -9.0 XE 4674 BNA N11121 0 82.0 657
6 2011/12/2 21.0 20.0 2120.0 2230.0 -5.0 -6.0 XE 4342 LRD N11113 0 53.0 301
df.tail(3)
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist
171 2011/12/30 11.0 14.0 1114.0 1141.0 9.0 6.0 XE 4323 HOB N11107 0 73.0 501
172 2011/12/30 18.0 31.0 1831.0 2145.0 16.0 9.0 XE 4634 CLT N11121 0 116.0 912
173 2011/12/31 14.0 8.0 1408.0 1705.0 2.0 -1.0 XE 4433 TYS N11107 0 98.0 771

4查看数据的各字段的类型

df.dtypes
date          object
hour         float64
minute       float64
dep          float64
arr          float64
dep_delay    float64
arr_delay    float64
carrier       object
flight         int64
dest          object
plane         object
cancelled      int64
time         float64
dist           int64
dtype: object

5查看数据是否有缺失值

df.isna().any()  #列级别的判断,只要该列有为空或者NA的元素,就为True,否则False  isna()和isnull()一样
date         False
hour          True
minute        True
dep           True
arr           True
dep_delay     True
arr_delay     True
carrier      False
flight       False
dest         False
plane        False
cancelled    False
time          True
dist         False
dtype: bool

6如果有缺失值,进行补0操作

df=df.fillna(0)
df.isna().any() 
date         False
hour         False
minute       False
dep          False
arr          False
dep_delay    False
arr_delay    False
carrier      False
flight       False
dest         False
plane        False
cancelled    False
time         False
dist         False
dtype: bool

7从df中按字段取出‘ date’’dist’和‘ flight’形成新的df1

df1 = df.loc[:,['date','dist','flight']]
df1.head()
date dist flight
0 2011/12/1 851 4319
1 2011/12/1 301 4342
2 2011/12/1 1157 4434
3 2011/12/1 851 4525
4 2011/12/1 1075 4602

8从df中按位置选取行为3和4,列为0,1的数据形成新的df2

df2 = df.iloc[[3,4],[0,1]]
df2
date hour
3 2011/12/1 19.0
4 2011/12/1 7.0

9在df中添加一列‘ low_dest’,内容是‘ dest’这列的字符串的小写形式

df['low_dest'] = df['dest'].str.lower()
df.head()
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist low_dest
0 2011/12/1 14.0 31.0 1431.0 1730.0 -4.0 -9.0 XE 4319 SAV N11121 0 104.0 851 sav
1 2011/12/1 21.0 15.0 2115.0 2229.0 -10.0 -7.0 XE 4342 LRD N11107 0 50.0 301 lrd
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434 RIC N11113 0 146.0 1157 ric
3 2011/12/1 19.0 24.0 1924.0 2234.0 -1.0 1.0 XE 4525 SAV N11106 0 105.0 851 sav
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602 DTW N11121 0 139.0 1075 dtw

10从df中选取‘ dist’大于1000并且小于1200的数据作为一个新的df3

df3 = df[(df.dist>1000) & (df.dist<1200)]
df3.head()
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist low_dest
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434 RIC N11113 0 146.0 1157 ric
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602 DTW N11121 0 139.0 1075 dtw
10 2011/12/2 7.0 13.0 713.0 1058.0 -2.0 -6.0 XE 4501 PIT N11107 0 146.0 1117 pit
14 2011/12/3 19.0 1.0 1901.0 2230.0 -1.0 -14.0 XE 4509 RIC N11113 0 134.0 1157 ric
16 2011/12/4 16.0 8.0 1608.0 2004.0 18.0 40.0 XE 4477 RDU N11121 0 123.0 1042 rdu

11从df中选取‘ time’大于100的数据做为新的df4

df4 = df[df.time>100]
df4.head()
date hour minute dep arr dep_delay arr_delay carrier flight dest plane cancelled time dist low_dest
0 2011/12/1 14.0 31.0 1431.0 1730.0 -4.0 -9.0 XE 4319 SAV N11121 0 104.0 851 sav
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434 RIC N11113 0 146.0 1157 ric
3 2011/12/1 19.0 24.0 1924.0 2234.0 -1.0 1.0 XE 4525 SAV N11106 0 105.0 851 sav
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602 DTW N11121 0 139.0 1075 dtw
7 2011/12/2 10.0 25.0 1025.0 1352.0 -5.0 -2.0 XE 4366 CLT N11121 0 120.0 912 clt

12将df3和df4合并为一个dataframe,命名为df5

df5 = pd.concat([df3,df4],axis = 1)  #横向合并 axis决定合并方向 ,axis = 1横向合并 axis = 0竖向合并
df5.head()
date hour minute dep arr dep_delay arr_delay carrier flight dest ... dep_delay arr_delay carrier flight dest plane cancelled time dist low_dest
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -4.0 -9.0 XE 4319.0 SAV N11121 0.0 104.0 851.0 sav
2 2011/12/1 10.0 47.0 1047.0 1429.0 20.0 17.0 XE 4434.0 RIC ... 20.0 17.0 XE 4434.0 RIC N11113 0.0 146.0 1157.0 ric
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -1.0 1.0 XE 4525.0 SAV N11106 0.0 105.0 851.0 sav
4 2011/12/1 7.0 18.0 718.0 1105.0 -2.0 -8.0 XE 4602.0 DTW ... -2.0 -8.0 XE 4602.0 DTW N11121 0.0 139.0 1075.0 dtw
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -5.0 -2.0 XE 4366.0 CLT N11121 0.0 120.0 912.0 clt

5 rows × 30 columns

df3.shape
(43, 15)
df4.shape
(85, 15)
df5.shape  #15+15=30
(86, 30)
pd.concat([df3,df4],axis = 0).shape  #竖向合并  43+85=128
(128, 15)

13取出df3中的‘ time’,‘dist’ 这两列做为df6, 求出df6中每列中最大值与最小值的差值(运用lambda函数的方式进行求解 )

df6 = df3.loc[:,['time','dist']]
df6.head()
df6.apply(lambda x : x.max() - x.min())

14将df根据“ dest”分组,统计各目的地,都有多少条数据,命名为“ dest_count”,然后将“ dest”和“ dest_count”组成新的dataframe命名为df7

df7 = df.groupby('dest',as_index = False)['dest'].agg({'dest_count':np.size})
df7.head()

15根据dep_delay不为0,统计df中有多少次航班起飞延误

df[df.dep_delay!=0].shape[0]

16利用plane字段,统计df中不同航班的飞行次数,以plane和plane_cout格式保存为df8

df8 = df.groupby('plane',as_index = False)['plane'].agg({'dest_count':np.size})
df8.head()

17利用carrier字段,计算df中carrier字段不同属性数据的数据量的比值(如:XE的数据量与MQ的数据量的比值)

df9=df.groupby(['carrier'],as_index=False)['carrier'].size().reset_index(name='carrier_count')
df9.head()
print("XE的数据量与MQ的数据量的比值: ", df9.iat[1,1]/df9.iat[0,1])

18分别将df7和 df8保存到本地文件,文件名分别为df7.csv和df8.csv

df7.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df7.csv")
df8.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df78.csv")
posted @ 2019-12-03 18:18  爱做梦的子浩  阅读(277)  评论(0编辑  收藏  举报