pandas强化练习(美国交警开放的数据)

这篇文章写得更好：http://wittyfans.com/coding/%E5%88%A9%E7%94%A8Pandas%E5%88%86%E6%9E%90%E7%BE%8E%E5%9B%BD%E4%BA%A4%E8%AD%A6%E5%BC%80%E6%94%BE%E7%9A%84%E6%90%9C%E6%9F%A5%E6%95%B0%E6%8D%AE.html

import pandas as pd
import matplotlib.pyplot as plt


#需要声明才能在notebook中画图
%matplotlib inline


#下载的罗曼的警务数据,这里以ri代表罗德曼岛警务数据
ri=pd.read_csv('police.csv')

ri.head()

Out[2]:

	stop_date	stop_time	county_name	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop
0	2005-01-02	01:55	NaN	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
1	2005-01-18	08:15	NaN	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
2	2005-01-23	23:15	NaN	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
3	2005-02-20	17:15	NaN	M	1986.0	19.0	White	Call for Service	Other	False	NaN	Arrest Driver	True	16-30 Min	False
4	2005-03-14	10:00	NaN	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False

In [3]:

ri.shape

Out[3]:

(91741, 15)

In [4]:

ri.isnull().sum()

Out[4]:

stop_date                 0
stop_time                 0
county_name           91741
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

移除某列

In [5]:

ri.head()

Out[5]:

	stop_date	stop_time	county_name	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop
0	2005-01-02	01:55	NaN	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
1	2005-01-18	08:15	NaN	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
2	2005-01-23	23:15	NaN	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
3	2005-02-20	17:15	NaN	M	1986.0	19.0	White	Call for Service	Other	False	NaN	Arrest Driver	True	16-30 Min	False
4	2005-03-14	10:00	NaN	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False

In [6]:

#写法等同于ri.drop('county_name', axis=1 , inplace=True)
#删除空值的
ri.drop('county_name', axis='columns', inplace=True)

In [7]:

ri.shape

Out[7]:

(91741, 14)

In [8]:

ri.columns

Out[8]:

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')

In [9]:

#删除有空值的行
ri.dropna(axis='columns',how='all').shape

Out[9]:

(91741, 14)

pandas过滤功能

保留布尔值为真的数据,这里我们保留violaton值为真的数据

In [10]:

ri[ri.violation=='Speeding'].head()

Out[10]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop
0	2005-01-02	01:55	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
1	2005-01-18	08:15	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
2	2005-01-23	23:15	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
4	2005-03-14	10:00	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
6	2005-04-01	17:30	M	1969.0	36.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False

values_counts

In [11]:

## 超速违规的驾驶员男女各多少人
print(ri[ri.violation=='Speeding'].driver_gender.value_counts()
     )

M    32979
F    15482
Name: driver_gender, dtype: int64

In [12]:

# 超速男女各占多少比例 normalize归一化处理
print(ri[ri.violation=='Speeding'].driver_gender.value_counts(normalize=True))

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

In [13]:

ri.loc[ri.violation=='Speeding','driver_gender'].value_counts(normalize=True)

Out[13]:

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

In [14]:

#男性驾驶员中,各种交通违规的比例
ri[ri.driver_gender == 'M'].violation.value_counts(normalize=True)

Out[14]:

Speeding               0.524350
Moving violation       0.207012
Equipment              0.135671
Other                  0.057668
Registration/plates    0.038461
Seat belt              0.036839
Name: violation, dtype: float64

In [15]:

#女性驾驶员中各种交通违规的比例
ri[ri.driver_gender=='F'].violation.value_counts(normalize=True)

Out[15]:

Speeding               0.658500
Moving violation       0.136277
Equipment              0.105780
Registration/plates    0.043086
Other                  0.029348
Seat belt              0.027009
Name: violation, dtype: float64

groupby方法

查看不同driver_gender,violation的各种值的占比

In [16]:

#对比以上两种数据
ri.groupby('driver_gender').violation.value_counts(normalize=True)

Out[16]:

driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64

mean方法

mean可以默认计算占比

In [17]:

#True为执行搜查,False为未执行搜查
print(ri.search_conducted.value_counts(normalize=True))

False    0.965163
True     0.034837
Name: search_conducted, dtype: float64

In [18]:

#这例men可以计算出True的咋还占比
print(ri.search_conducted.mean())

0.03483720473942948

男女分组看他们的搜索值

In [19]:

ri.groupby('driver_gender').search_conducted.mean()

Out[19]:

driver_gender
F    0.020033
M    0.043326
Name: search_conducted, dtype: float64

男的搜查比例比女的高

再看一下如果是多重分组,男女搜查的比例

In [20]:

ri.groupby(['violation','driver_gender']).search_conducted.mean()

Out[20]:

violation            driver_gender
Equipment            F                0.042622
                     M                0.070081
Moving violation     F                0.036205
                     M                0.059831
Other                F                0.056522
                     M                0.047146
Registration/plates  F                0.066140
                     M                0.110376
Seat belt            F                0.012598
                     M                0.037980
Speeding             F                0.008720
                     M                0.024925
Name: search_conducted, dtype: float64

In [21]:

ri.isnull().sum()

Out[21]:

stop_date                 0
stop_time                 0
driver_gender          5335
driver_age_raw         5327
driver_age             5621
driver_race            5333
violation_raw          5333
violation              5333
search_conducted          0
search_type           88545
stop_outcome           5333
is_arrested            5333
stop_duration          5333
drugs_related_stop        0
dtype: int64

In [22]:

#是否search_conducted为false的时候,search_type都丢失了
ri.search_conducted.value_counts()

Out[22]:

False    88545
True      3196
Name: search_conducted, dtype: int64

是不是数值和上面的search_type丢失的值相同啊

再次验证一下

In [23]:

ri[ri.search_conducted==False].search_type.value_counts()

Out[23]:

Series([], Name: search_type, dtype: int64)

In [24]:

#value_counts()这个方法时候默认忽略丢失值(空值)
ri[ri.search_conducted==False].search_type.value_counts(dropna=False)

Out[24]:

NaN    88545
Name: search_type, dtype: int64

In [25]:

#当searcch_conducted的值为True,search_type从来不丢失
ri[ri.search_conducted==True].search_type.value_counts(dropna=False)

Out[25]:

Incident to Arrest                                          1219
Probable Cause                                               891
Inventory                                                    220
Reasonable Suspicion                                         197
Protective Frisk                                             161
Incident to Arrest,Inventory                                 129
Incident to Arrest,Probable Cause                            106
Probable Cause,Reasonable Suspicion                           75
Incident to Arrest,Inventory,Probable Cause                   34
Incident to Arrest,Protective Frisk                           33
Probable Cause,Protective Frisk                               33
Inventory,Probable Cause                                      22
Incident to Arrest,Reasonable Suspicion                       13
Incident to Arrest,Inventory,Protective Frisk                 11
Protective Frisk,Reasonable Suspicion                         11
Inventory,Protective Frisk                                    11
Incident to Arrest,Probable Cause,Protective Frisk            10
Incident to Arrest,Probable Cause,Reasonable Suspicion         6
Incident to Arrest,Inventory,Reasonable Suspicion              4
Inventory,Reasonable Suspicion                                 4
Inventory,Probable Cause,Protective Frisk                      2
Inventory,Probable Cause,Reasonable Suspicion                  2
Incident to Arrest,Protective Frisk,Reasonable Suspicion       1
Probable Cause,Protective Frisk,Reasonable Suspicion           1
Name: search_type, dtype: int64

In [26]:

ri[ri.search_conducted==True].search_type.isnull().sum()

Out[26]:

查看搜索类型

In [27]:

ri.search_type.value_counts(dropna=False)

Out[27]:

NaN                                                         88545
Incident to Arrest                                           1219
Probable Cause                                                891
Inventory                                                     220
Reasonable Suspicion                                          197
Protective Frisk                                              161
Incident to Arrest,Inventory                                  129
Incident to Arrest,Probable Cause                             106
Probable Cause,Reasonable Suspicion                            75
Incident to Arrest,Inventory,Probable Cause                    34
Incident to Arrest,Protective Frisk                            33
Probable Cause,Protective Frisk                                33
Inventory,Probable Cause                                       22
Incident to Arrest,Reasonable Suspicion                        13
Inventory,Protective Frisk                                     11
Incident to Arrest,Inventory,Protective Frisk                  11
Protective Frisk,Reasonable Suspicion                          11
Incident to Arrest,Probable Cause,Protective Frisk             10
Incident to Arrest,Probable Cause,Reasonable Suspicion          6
Incident to Arrest,Inventory,Reasonable Suspicion               4
Inventory,Reasonable Suspicion                                  4
Inventory,Probable Cause,Reasonable Suspicion                   2
Inventory,Probable Cause,Protective Frisk                       2
Incident to Arrest,Protective Frisk,Reasonable Suspicion        1
Probable Cause,Protective Frisk,Reasonable Suspicion            1
Name: search_type, dtype: int64

In [28]:

ri['frisk']=ri.search_type=='Protective Frisk'

In [29]:

ri.frisk.dtype

Out[29]:

dtype('bool')

In [30]:

ri.frisk.sum()

Out[30]:

In [31]:

ri.frisk.mean()

Out[31]:

0.0017549405391264537

In [32]:

ri.frisk.value_counts()

Out[32]:

False    91580
True       161
Name: frisk, dtype: int64

In [33]:

161/(91580+161)

Out[33]:

0.0017549405391264537

字符操作

In [35]:

#上面的操作是把ri.search_type=='Protective Frisk'的值付给日['firsk']这一列
#现在是字符串的包含操作
ri['frisk']=ri.search_type.str.contains('Protective Frisk')

In [36]:

ri.frisk.sum()

Out[36]:

In [37]:

ri.frisk.mean()

Out[37]:

0.08573216520650813

In [38]:

#用mean（）计算符合条件和不符合条件的占比
ri.frisk.value_counts()

Out[38]:

False    2922
True      274
Name: frisk, dtype: int64

In [41]:

#再看一下他们的计算是否和men（）的结构一样
274/(2922+274)

Out[41]:

0.08573216520650813

上面的这一部分是计算字符串匹配操作

用正确的关键字去计算比例

pandas计算式忽略缺失值的

In [42]:

#那一年的数据最少
ri.stop_date.str.slice(0,4).value_counts()

Out[42]:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558
Name: stop_date, dtype: int64

In [43]:

#将ri.stop_date转化为datetime的格式的dataframe，存到stop_datetime新列中
ri['stop_datetime'] = pd.to_datetime(ri.stop_date)

#注意这里有dt方法，类似于上面的str方法
#dt后可以使用year、month等方法
ri.stop_datetime.dt.year.value_counts()

Out[43]:

2012    10970
2006    10639
2007     9476
2014     9228
2008     8752
2015     8599
2011     8126
2013     7924
2009     7908
2010     7561
2005     2558
Name: stop_datetime, dtype: int64

In [44]:

ri.stop_datetime.dt.month.value_counts()

Out[44]:

1     8479
5     7935
11    7877
10    7745
3     7742
6     7630
8     7615
7     7568
4     7529
9     7427
12    7152
2     7042
Name: stop_datetime, dtype: int64

In [46]:

#关于毒驾
ri.drugs_related_stop.dtype

Out[46]:

dtype('bool')

In [48]:

#基础比例
ri.drugs_related_stop.mean()

Out[48]:

0.008883705213590434

In [55]:

#不能使用小时分组，除非你创建了小时这一列
#取出小时列，转换成时间格式，再转化才成小时分组
ri['stop_time_datetime']=pd.to_datetime(ri.stop_time)
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean()

Out[55]:

stop_time_datetime
0     0.019728
1     0.013507
2     0.015462
3     0.017065
4     0.011811
5     0.004762
6     0.003040
7     0.003281
8     0.002687
9     0.006288
10    0.005714
11    0.006976
12    0.004467
13    0.010326
14    0.007810
15    0.006416
16    0.005723
17    0.005517
18    0.010148
19    0.011596
20    0.008084
21    0.013342
22    0.013533
23    0.016344
Name: drugs_related_stop, dtype: float64

In [58]:

#按小时的时毒驾频率分布图
ri.groupby(ri.stop_time_datetime.dt.hour).drugs_related_stop.mean().plot()

Out[58]:

<matplotlib.axes._subplots.AxesSubplot at 0x9d72d30>

In [63]:

#按小时的，毒驾数量分布图
ri.stop_time_datetime.dt.hour.value_counts().plot()

Out[63]:

<matplotlib.axes._subplots.AxesSubplot at 0x5460710>

In [65]:

#按小时分组，毒驾数量排序分布图
ri.stop_time_datetime.dt.hour.value_counts().sort_index().plot()

Out[65]:

<matplotlib.axes._subplots.AxesSubplot at 0x5420860>

In [66]:

ri.groupby(ri.stop_time_datetime.dt.hour).stop_date.count().plot()

Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x557c2e8>

In [68]:

#把无用的数据标记为丢失值
ri.stop_duration.value_counts()

Out[68]:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
1                1
2                1
Name: stop_duration, dtype: int64

In [73]:

ri[(ri.stop_duration=='1')|(ri.stop_duration=='2')].stop_duration='NaN'

C:\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value

In [74]:

ri.stop_duration.value_counts()

Out[74]:

0-15 Min     69543
16-30 Min    13635
30+ Min       3228
1                1
2                1
Name: stop_duration, dtype: int64

In [75]:

ri.loc[(ri.stop_duration=='1')|(ri.stop_duration=='2'),'stop_duration']='NaN'

In [76]:

ri.stop_duration.value_counts(dropna=False)

Out[76]:

0-15 Min     69543
16-30 Min    13635
NaN           5333
30+ Min       3228
NaN              2
Name: stop_duration, dtype: int64

In [77]:

#用执行的nan类型替换NaN
import numpy as np
ri.loc[ri.stop_duration == 'NaN', 'stop_duration'] = np.nan

In [79]:

ri.stop_duration.value_counts(dropna=False)

Out[79]:

0-15 Min     69543
16-30 Min    13635
NaN           5335
30+ Min       3228
Name: stop_duration, dtype: int64

In [80]:

ri.stop_duration.replace(['1', '2'], value=np.nan, inplace=True)

In [118]:

# stop_duration中的各种比例
#Series的map方法可以接受一个函数或含有映射关系的字典型对象。
#对某一个列进行批操作，本文中是批量替换
mapping={'0-15 Min':8,'16-30 Min':23,'30+ Min':45}

#记得这不是原地操作原始数据，需要新建一列存储map后的结果
ri['stop_minutes'] = ri.stop_duration.map(mapping)

In [119]:

#为各种粘皮匹配值
ri.stop_minutes.value_counts()

Out[119]:

8.0     69543
23.0    13635
45.0     3228
Name: stop_minutes, dtype: int64

In [120]:

ri.groupby('violation_raw').stop_minutes.mean()

Out[120]:

violation_raw
APB                                 20.987342
Call for Service                    22.034669
Equipment/Inspection Violation      11.460345
Motorist Assist/Courtesy            16.916256
Other Traffic Violation             13.900265
Registration Violation              13.745629
Seatbelt Violation                   9.741531
Special Detail/Directed Patrol      15.061100
Speeding                            10.577690
Suspicious Person                   18.750000
Violation of City/Town Ordinance    13.388626
Warrant                             21.400000
Name: stop_minutes, dtype: float64

In [143]:

# 使用某种方法如mean、count对某类数据进行操作。

# 过去agg只能groupby之后的数据进行操作，现在还可以对dataframe类、series类进行操作。
ri.groupby('violation_raw').stop_minutes.agg(['mean','count'])

Out[143]:

	mean	count
violation_raw
APB	20.987342	79
Call for Service	22.034669	1298
Equipment/Inspection Violation	11.460345	11020
Motorist Assist/Courtesy	16.916256	203
Other Traffic Violation	13.900265	16223
Registration Violation	13.745629	3432
Seatbelt Violation	9.741531	2952
Special Detail/Directed Patrol	15.061100	2455
Speeding	10.577690	48462
Suspicious Person	18.750000	56
Violation of City/Town Ordinance	13.388626	211
Warrant	21.400000	15

plot 默认是折线方法

In [165]:

ri.groupby('violation_raw').stop_minutes.mean().plot()

Out[165]:

<matplotlib.axes._subplots.AxesSubplot at 0x10873ef0>

In [167]:

#换成bartu
ri.groupby('violation_raw').stop_minutes.mean().plot(kind='bar')

Out[167]:

<matplotlib.axes._subplots.AxesSubplot at 0x1092eb38>

In [168]:

ri.groupby('violation_raw').stop_minutes.mean().plot(kind='barh')

Out[168]:

<matplotlib.axes._subplots.AxesSubplot at 0x10a4a5f8>

In [147]:

ri.groupby('violation').driver_age.describe()

Out[147]:

	count	mean	std	min	25%	50%	75%	max
violation
Equipment	11007.0	31.781503	11.400900	16.0	23.0	28.0	38.0	89.0
Moving violation	16164.0	36.120020	13.185805	15.0	25.0	33.0	46.0	99.0
Other	4204.0	39.536870	13.034639	16.0	28.0	39.0	49.0	87.0
Registration/plates	3427.0	32.803035	11.033675	16.0	24.0	30.0	40.0	74.0
Seat belt	2952.0	32.206301	11.213122	17.0	24.0	29.0	38.0	77.0
Speeding	48361.0	33.530097	12.821847	15.0	23.0	30.0	42.0	90.0

In [148]:

ri.driver_age.plot(kind='hist')

Out[148]:

<matplotlib.axes._subplots.AxesSubplot at 0x1003a518>

In [149]:

ri.driver_age.value_counts().sort_index().plot()

Out[149]:

<matplotlib.axes._subplots.AxesSubplot at 0x10088080>

In [150]:

ri.hist('driver_age', by='violation')

Out[150]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000100D8438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010111208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000000001013B898>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010163F28>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000101945F8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010194630>]],
      dtype=object)

In [151]:

ri.hist('driver_age',by='violation',sharex=True)

Out[151]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000102C6C50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000103243C8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010346908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010370E80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1438>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000103A1470>]],
      dtype=object)

In [152]:

ri.hist('driver_age',by='violation',sharex=True,sharey=True)

Out[152]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000104C4F98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001059D358>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000000105C0748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000105E9B38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F28>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000010613F60>]],
      dtype=object)

In [153]:

ri.head()

Out[153]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_time_datetime	stop_minutes	new_age
0	2005-01-02	01:55	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-02	2019-04-05 01:55:00	8.0	20.0
1	2005-01-18	08:15	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-18	2019-04-05 08:15:00	8.0	40.0
2	2005-01-23	23:15	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-01-23	2019-04-05 23:15:00	8.0	33.0
3	2005-02-20	17:15	M	1986.0	19.0	White	Call for Service	Other	False	NaN	Arrest Driver	True	16-30 Min	False	NaN	2005-02-20	2019-04-05 17:15:00	23.0	19.0
4	2005-03-14	10:00	F	1984.0	21.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2005-03-14	2019-04-05 10:00:00	8.0	21.0

In [154]:

ri.tail()

Out[154]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_time_datetime	stop_minutes	new_age
91736	2015-12-31	20:27	M	1986.0	29.0	White	Speeding	Speeding	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31	2019-04-05 20:27:00	8.0	29.0
91737	2015-12-31	20:35	F	1982.0	33.0	White	Equipment/Inspection Violation	Equipment	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31	2019-04-05 20:35:00	8.0	33.0
91738	2015-12-31	20:45	M	1992.0	23.0	White	Other Traffic Violation	Moving violation	False	NaN	Warning	False	0-15 Min	False	NaN	2015-12-31	2019-04-05 20:45:00	8.0	23.0
91739	2015-12-31	21:42	M	1993.0	22.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2015-12-31	2019-04-05 21:42:00	8.0	22.0
91740	2015-12-31	22:46	M	1959.0	56.0	Hispanic	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	NaN	2015-12-31	2019-04-05 22:46:00	8.0	56.0

In [155]:

ri['new_age']=ri.stop_datetime.dt.year-ri.driver_age_raw

In [156]:

ri[['driver_age','new_age']].hist()

Out[156]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x00000000107FE7F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000000001083C2E8>]],
      dtype=object)

In [157]:

ri[['driver_age','new_age']].describe()

Out[157]:

	driver_age	new_age
count	86120.000000	86414.000000
mean	34.011333	39.784294
std	12.738564	110.822145
min	15.000000	-6794.000000
25%	23.000000	24.000000
50%	31.000000	31.000000
75%	43.000000	43.000000
max	99.000000	2015.000000

In [158]:

ri[(ri.new_age<15)|(ri.new_age>99)].shape

Out[158]:

(294, 19)

In [159]:

ri.driver_age_raw.isnull().sum()

Out[159]:

In [160]:

ri.driver_age.isnull().sum()

Out[160]:

In [161]:

5621-5327

Out[161]:

In [162]:

ri[(ri.driver_age_raw.notnull())&(ri.driver_age.isnull())].head()

Out[162]:

	stop_date	stop_time	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	frisk	stop_datetime	stop_time_datetime	stop_minutes	new_age
146	2005-10-05	08:50	M	0.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Citation	False	0-15 Min	False	NaN	2005-10-05	2019-04-05 08:50:00	8.0	2005.0
281	2005-10-10	12:05	F	0.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Warning	False	0-15 Min	False	NaN	2005-10-10	2019-04-05 12:05:00	8.0	2005.0
331	2005-10-12	07:50	M	0.0	NaN	White	Motorist Assist/Courtesy	Other	False	NaN	No Action	False	0-15 Min	False	NaN	2005-10-12	2019-04-05 07:50:00	8.0	2005.0
414	2005-10-17	08:32	M	2005.0	NaN	White	Other Traffic Violation	Moving violation	False	NaN	Citation	False	0-15 Min	False	NaN	2005-10-17	2019-04-05 08:32:00	8.0	0.0
455	2005-10-18	18:30	F	0.0	NaN	White	Speeding	Speeding	False	NaN	Warning	False	0-15 Min	False	NaN	2005-10-18	2019-04-05 18:30:00	8.0	2005.0

In [163]:

ri.loc[(ri.new_age<15)|(ri.new_age>99),'new_age']=np.nan

In [164]:

ri.new_age.equals(ri.driver_age)

Out[164]:

True

posted @ 2019-04-05 22:43 阿布_alone 阅读(411) 评论(0) 编辑收藏举报

刷新页面返回顶部

阿布alone