因为本博客是直接导入markdown文件(jupter写的)生成的,所以会导致分不清代码和输出,如果大家需要清楚的代码和输出,就请下载以下html文件
pandasTrain.html
实训练习:练习使用pandas 包中的dataframe和方法。
1将数据读取为dataframe类型,命名为df
import pandas as pd
import numpy as np
df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\data\\flight.csv")
df.head()
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
0 |
2011/12/1 |
14.0 |
31.0 |
1431.0 |
1730.0 |
-4.0 |
-9.0 |
XE |
4319 |
SAV |
N11121 |
0 |
104.0 |
851 |
1 |
2011/12/1 |
21.0 |
15.0 |
2115.0 |
2229.0 |
-10.0 |
-7.0 |
XE |
4342 |
LRD |
N11107 |
0 |
50.0 |
301 |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434 |
RIC |
N11113 |
0 |
146.0 |
1157 |
3 |
2011/12/1 |
19.0 |
24.0 |
1924.0 |
2234.0 |
-1.0 |
1.0 |
XE |
4525 |
SAV |
N11106 |
0 |
105.0 |
851 |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602 |
DTW |
N11121 |
0 |
139.0 |
1075 |
2查看数据的形状[了解数据有多少行、多少列]
df.shape
(174, 14)
3查看数据的最开始的7条数据和尾部的3条数据
df.head(7)
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
0 |
2011/12/1 |
14.0 |
31.0 |
1431.0 |
1730.0 |
-4.0 |
-9.0 |
XE |
4319 |
SAV |
N11121 |
0 |
104.0 |
851 |
1 |
2011/12/1 |
21.0 |
15.0 |
2115.0 |
2229.0 |
-10.0 |
-7.0 |
XE |
4342 |
LRD |
N11107 |
0 |
50.0 |
301 |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434 |
RIC |
N11113 |
0 |
146.0 |
1157 |
3 |
2011/12/1 |
19.0 |
24.0 |
1924.0 |
2234.0 |
-1.0 |
1.0 |
XE |
4525 |
SAV |
N11106 |
0 |
105.0 |
851 |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602 |
DTW |
N11121 |
0 |
139.0 |
1075 |
5 |
2011/12/1 |
20.0 |
18.0 |
2018.0 |
2200.0 |
-2.0 |
-9.0 |
XE |
4674 |
BNA |
N11121 |
0 |
82.0 |
657 |
6 |
2011/12/2 |
21.0 |
20.0 |
2120.0 |
2230.0 |
-5.0 |
-6.0 |
XE |
4342 |
LRD |
N11113 |
0 |
53.0 |
301 |
df.tail(3)
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
171 |
2011/12/30 |
11.0 |
14.0 |
1114.0 |
1141.0 |
9.0 |
6.0 |
XE |
4323 |
HOB |
N11107 |
0 |
73.0 |
501 |
172 |
2011/12/30 |
18.0 |
31.0 |
1831.0 |
2145.0 |
16.0 |
9.0 |
XE |
4634 |
CLT |
N11121 |
0 |
116.0 |
912 |
173 |
2011/12/31 |
14.0 |
8.0 |
1408.0 |
1705.0 |
2.0 |
-1.0 |
XE |
4433 |
TYS |
N11107 |
0 |
98.0 |
771 |
4查看数据的各字段的类型
df.dtypes
date object
hour float64
minute float64
dep float64
arr float64
dep_delay float64
arr_delay float64
carrier object
flight int64
dest object
plane object
cancelled int64
time float64
dist int64
dtype: object
5查看数据是否有缺失值
df.isna().any()
date False
hour True
minute True
dep True
arr True
dep_delay True
arr_delay True
carrier False
flight False
dest False
plane False
cancelled False
time True
dist False
dtype: bool
6如果有缺失值,进行补0操作
df=df.fillna(0)
df.isna().any()
date False
hour False
minute False
dep False
arr False
dep_delay False
arr_delay False
carrier False
flight False
dest False
plane False
cancelled False
time False
dist False
dtype: bool
7从df中按字段取出‘ date’’dist’和‘ flight’形成新的df1
df1 = df.loc[:,['date','dist','flight']]
df1.head()
|
date |
dist |
flight |
0 |
2011/12/1 |
851 |
4319 |
1 |
2011/12/1 |
301 |
4342 |
2 |
2011/12/1 |
1157 |
4434 |
3 |
2011/12/1 |
851 |
4525 |
4 |
2011/12/1 |
1075 |
4602 |
8从df中按位置选取行为3和4,列为0,1的数据形成新的df2
df2 = df.iloc[[3,4],[0,1]]
df2
|
date |
hour |
3 |
2011/12/1 |
19.0 |
4 |
2011/12/1 |
7.0 |
9在df中添加一列‘ low_dest’,内容是‘ dest’这列的字符串的小写形式
df['low_dest'] = df['dest'].str.lower()
df.head()
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
low_dest |
0 |
2011/12/1 |
14.0 |
31.0 |
1431.0 |
1730.0 |
-4.0 |
-9.0 |
XE |
4319 |
SAV |
N11121 |
0 |
104.0 |
851 |
sav |
1 |
2011/12/1 |
21.0 |
15.0 |
2115.0 |
2229.0 |
-10.0 |
-7.0 |
XE |
4342 |
LRD |
N11107 |
0 |
50.0 |
301 |
lrd |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434 |
RIC |
N11113 |
0 |
146.0 |
1157 |
ric |
3 |
2011/12/1 |
19.0 |
24.0 |
1924.0 |
2234.0 |
-1.0 |
1.0 |
XE |
4525 |
SAV |
N11106 |
0 |
105.0 |
851 |
sav |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602 |
DTW |
N11121 |
0 |
139.0 |
1075 |
dtw |
10从df中选取‘ dist’大于1000并且小于1200的数据作为一个新的df3
df3 = df[(df.dist>1000) & (df.dist<1200)]
df3.head()
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
low_dest |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434 |
RIC |
N11113 |
0 |
146.0 |
1157 |
ric |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602 |
DTW |
N11121 |
0 |
139.0 |
1075 |
dtw |
10 |
2011/12/2 |
7.0 |
13.0 |
713.0 |
1058.0 |
-2.0 |
-6.0 |
XE |
4501 |
PIT |
N11107 |
0 |
146.0 |
1117 |
pit |
14 |
2011/12/3 |
19.0 |
1.0 |
1901.0 |
2230.0 |
-1.0 |
-14.0 |
XE |
4509 |
RIC |
N11113 |
0 |
134.0 |
1157 |
ric |
16 |
2011/12/4 |
16.0 |
8.0 |
1608.0 |
2004.0 |
18.0 |
40.0 |
XE |
4477 |
RDU |
N11121 |
0 |
123.0 |
1042 |
rdu |
11从df中选取‘ time’大于100的数据做为新的df4
df4 = df[df.time>100]
df4.head()
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
low_dest |
0 |
2011/12/1 |
14.0 |
31.0 |
1431.0 |
1730.0 |
-4.0 |
-9.0 |
XE |
4319 |
SAV |
N11121 |
0 |
104.0 |
851 |
sav |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434 |
RIC |
N11113 |
0 |
146.0 |
1157 |
ric |
3 |
2011/12/1 |
19.0 |
24.0 |
1924.0 |
2234.0 |
-1.0 |
1.0 |
XE |
4525 |
SAV |
N11106 |
0 |
105.0 |
851 |
sav |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602 |
DTW |
N11121 |
0 |
139.0 |
1075 |
dtw |
7 |
2011/12/2 |
10.0 |
25.0 |
1025.0 |
1352.0 |
-5.0 |
-2.0 |
XE |
4366 |
CLT |
N11121 |
0 |
120.0 |
912 |
clt |
12将df3和df4合并为一个dataframe,命名为df5
df5 = pd.concat([df3,df4],axis = 1)
df5.head()
|
date |
hour |
minute |
dep |
arr |
dep_delay |
arr_delay |
carrier |
flight |
dest |
... |
dep_delay |
arr_delay |
carrier |
flight |
dest |
plane |
cancelled |
time |
dist |
low_dest |
0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
... |
-4.0 |
-9.0 |
XE |
4319.0 |
SAV |
N11121 |
0.0 |
104.0 |
851.0 |
sav |
2 |
2011/12/1 |
10.0 |
47.0 |
1047.0 |
1429.0 |
20.0 |
17.0 |
XE |
4434.0 |
RIC |
... |
20.0 |
17.0 |
XE |
4434.0 |
RIC |
N11113 |
0.0 |
146.0 |
1157.0 |
ric |
3 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
... |
-1.0 |
1.0 |
XE |
4525.0 |
SAV |
N11106 |
0.0 |
105.0 |
851.0 |
sav |
4 |
2011/12/1 |
7.0 |
18.0 |
718.0 |
1105.0 |
-2.0 |
-8.0 |
XE |
4602.0 |
DTW |
... |
-2.0 |
-8.0 |
XE |
4602.0 |
DTW |
N11121 |
0.0 |
139.0 |
1075.0 |
dtw |
7 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
... |
-5.0 |
-2.0 |
XE |
4366.0 |
CLT |
N11121 |
0.0 |
120.0 |
912.0 |
clt |
5 rows × 30 columns
df3.shape
(43, 15)
df4.shape
(85, 15)
df5.shape
(86, 30)
pd.concat([df3,df4],axis = 0).shape
(128, 15)
13取出df3中的‘ time’,‘dist’ 这两列做为df6, 求出df6中每列中最大值与最小值的差值(运用lambda函数的方式进行求解 )
df6 = df3.loc[:,['time','dist']]
df6.head()
df6.apply(lambda x : x.max() - x.min())
14将df根据“ dest”分组,统计各目的地,都有多少条数据,命名为“ dest_count”,然后将“ dest”和“ dest_count”组成新的dataframe命名为df7
df7 = df.groupby('dest',as_index = False)['dest'].agg({'dest_count':np.size})
df7.head()
15根据dep_delay不为0,统计df中有多少次航班起飞延误
df[df.dep_delay!=0].shape[0]
16利用plane字段,统计df中不同航班的飞行次数,以plane和plane_cout格式保存为df8
df8 = df.groupby('plane',as_index = False)['plane'].agg({'dest_count':np.size})
df8.head()
17利用carrier字段,计算df中carrier字段不同属性数据的数据量的比值(如:XE的数据量与MQ的数据量的比值)
df9=df.groupby(['carrier'],as_index=False)['carrier'].size().reset_index(name='carrier_count')
df9.head()
print("XE的数据量与MQ的数据量的比值: ", df9.iat[1,1]/df9.iat[0,1])
18分别将df7和 df8保存到本地文件,文件名分别为df7.csv和df8.csv
df7.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df7.csv")
df8.to_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp2\\df78.csv")
大家好,我是[爱做梦的子浩](https://blog.csdn.net/weixin_43124279),我是东北大学大数据实验班大三的小菜鸡,非常向往优秀,羡慕优秀的人,已拿两个暑假offer,欢迎大家找我进行交流😂😂😂
这是我的博客地址:[子浩的博客https://blog.csdn.net/weixin_43124279]
——
版权声明:本文为CSDN博主「爱做梦的子浩」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。