大数据实践（三）：葡萄牙银行数据集的数据预处理

实验目标

对数据集做数据预处理以便可以进行后续的机器学习。具体包括通过多种方式处理缺失值、将变量转为数值类型，使用机器学习模型填充缺失值，数据shuffle和持久化。

实验要求

完成对数据集缺失值的处理
完成对数据集非数值变量的转换
完成对数据集的标准化
保存预处理后的数据集

实验过程

变量介绍

银行客户信息:

1 - age：年龄 (数字)
2 - job：工作类型。管理员（admin）,蓝领（blue-collar）,企业家（entrepreneur）,家庭主妇（housemaid）,管理者（'management'）,退休（'retired'）,个体经营（'self-employed'）,服务业（'services'）,学生（'student'）,技术人员（'technician'）,无业（'unemployed'）,未知（'unknown')
3 - marital : 婚姻状态，离婚（'divorced'）,结婚（'married'）,单身（'single'）,未知（'unknown'）。说明：离婚也包括寡居
4 - education：教育情况：基本4年('basic.4y'), 基本6年（'basic.6y'）,基本九年（'basic.9y'）,高中（'high.school'）,文盲（'illiterate'）,专业课程（'professional.course'）,大学学位（'university.degree'）,未知（'unknown')
5 - default: 是否有信用违约? ('no','yes','unknown')
6 - housing: 是否有房贷 ( 'no','yes','unknown')
7 - loan: 是否有个人贷款 (categorical: 'no','yes','unknown')
与联络相关信息:
8 - contact: 联系类型，手机（ 'cellular'）,电话：'telephone'
9 - month: 年度最后一次联系的月份 (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: 最后一次联系的星期 (categorical: 'mon','tue','wed','thu','fri')
11 - duration: 上一次联系的通话时长（秒）. 重要提示：此属性高度影响输出目标（例如，如果持续时间=0，则y='no'）。然而，在执行呼叫之前，持续时间还不知道。而且，在通话结束后，Y显然是已知的。因此，这个输入应该只包括在基准测试中，如果想要有一个实际的预测模型，就应该丢弃它。（预测时不知道会通话的时长）
其他属性:
12 - campaign: 针对该客户，为了此次营销所发起联系的数量。（数字，包括最后一次联络）
13 - pdays: 上次营销到现在已经过了多少天。(数字，如果是999表示这个客户还没有联系过)
14 - previous: 在本次营销之前和客户联系过几次（数字）
15 - poutcome: 上一次营销活动的结果 ( 'failure','nonexistent','success')
社会和经济相关属性
16 - emp.var.rate: 就业变动率 -系度指标(numeric)
17 - cons.price.idx: 消费物价指数-月度指标 (numeric)
18 - cons.conf.idx: 消费者信心指数--月度指标(numeric)
19 - euribor3m: 欧元同业拆借利率3个月 - 每日指标 (numeric)
20 - nr.employed: 员工数量-季度指标 (numeric)
输出变量（目标）:
21 - y -客户存钱了吗（被成功营销了吗）? (binary: 'yes','no')

数据预处理

1. 数据装载

数据装载，使用head()观察数据
为了方便后续处理，将分类变量和数值变量的列名分别存放在不同列表中
```
numberVar=['age',...]
categoryVar = [ ...]
```


import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
df=pd.read_csv("bank-additional-full.csv",sep=';')
df.shape

(41188, 21)

numberVar=['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']
categoryVar=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome','y']

2.缺失值处理

数据集的输入变量是20个特征量，分为数值变量（numeric）和分类（categorical）变量。从前期数据信息可以看出，数值型变量（int64和float64）没有缺失。非数值型变量可能存在unknown值。本小节要求：

检查每个变量的缺失值占比情况
给出存在缺失值的变量中：高、中、低三类缺失情况

2.1 缺失值检查

数据集的输入变量是20个特征量，分为数值变量（numeric）和分类（categorical）变量。
使用df.isnull().any()观察缺失值情况，没有发现特征含有缺失值(NaN)。
但是在本数据集中，缺失值是以其他的形式存在的。分类变量大部分的特征都是使用unknown来表示缺失值，而poutcome是使用nonexistent来表示；数值变量中只有pdays存在缺失值（以数字999形式存在）。 本步骤要求对所有存在缺失值的分类变量打印其缺失值占比

对所有分类变量（外加一个pdays变量）进行缺失值的比例检查。对比Demo所不同的是，有三种值(unknown,nonexistent,999)都算作缺失：

cols = categoryVar + ['pdays']
total=df.shape[0]
for col in cols:
    v = df[col].value_counts().to_dict()
    if 'unknown' in v.keys():
        unCount = v['unknown']
    elif 'nonexistent' in v.keys():
        unCount = v['nonexistent']
    elif '999' in v.keys():
        unCount = v['999']
    else:
        continue    
    print ("%-10s: %5.1f%%"%(col,unCount/total*100))

job       :   0.8%
marital   :   0.2%
education :   4.2%
default   :  20.9%
housing   :   2.4%
loan      :   2.4%
poutcome  :  86.3%

2.2 高缺失比例的变量处理

通过直方图对pdays变量进行可视化，请给出分析，未缺失的pdays大概都在一个怎样的数值范围内？
通过pdays与poutcome的交叉表，观察这两个变量取值的关系，通过数据分析得到进一步结论

将pdays中非缺失值的部分进行直方图可视化：

dfPdays=df.loc[df.pdays != 999, 'pdays']

使用dfPdays进行直方图可视化，配合.value_counts()方法，分析大部分的营销间隔在什么时间范围内？

# 对pdays绘制直方图
dfPdays = df.loc[df.pdays!=999,'pdays']
plt.hist(dfPdays,bins=30,rwidth=0.8)

(array([ 15.,  26.,  61., 439., 118.,  46., 412.,  60.,  18.,   0.,  64.,
         52.,  28.,  58.,  36.,  20.,  24.,  11.,   8.,   0.,   7.,   3.,
          1.,   2.,   3.,   0.,   0.,   1.,   1.,   1.]),
 array([ 0. ,  0.9,  1.8,  2.7,  3.6,  4.5,  5.4,  6.3,  7.2,  8.1,  9. ,
         9.9, 10.8, 11.7, 12.6, 13.5, 14.4, 15.3, 16.2, 17.1, 18. , 18.9,
        19.8, 20.7, 21.6, 22.5, 23.4, 24.3, 25.2, 26.1, 27. ]),
 <a list of 30 Patch objects>)

虽然这两个变量的缺失较多，但是未缺失的记录还是有一定的参考意义。根据前文热力图分析，发现pdays（-0.31）和poutcom（-0.13）对营销结果相关性较很多其他变量都要高，虽然此列的缺失值较多，但是不做删除考虑，保持现有状态。

要求使用交叉表观察pdays和poutcome之间的关系。为了方便观察，需要将pdays对5取整转为时间段（类似年龄段的做法）

pdaysDf = df['pdays'].apply(lambda x: int(x /5 )*5)
pd.crosstab(pdaysDf,df['poutcome']) #显示交叉表

poutcome	failure	nonexistent	success
pdays
0	6	0	653
5	74	0	526
10	36	0	158
15	22	0	31
20	3	0	3
25	1	0	2
995	4110	35563	0

2.3 default（信用违约）缺失值分析和处理

default: 缺失值占比20.9%，考虑对缺失值进行分析和修补
要求：

default的取值分布中有何启示？
对存在信用违约记录缺失的用户群体特征进行描述。（请在变量的用户信息中取出变量一一与default进行可视化）
说明最后对default的处理，为何采用unknown与yes记录合并的做法

在对default进行修补之前，先观察该变量取值情况。（使用value_counts()）

df['default'].value_counts()

no         32588
unknown     8597
yes            3
Name: default, dtype: int64

定义如下函数，参数1为dataframe，参数2为需要与default进行对比的列

In [7]:

def defaultAsso(dataset, col):
    tab = pd.crosstab(dataset['default'],dataset[col]).apply(lambda x: x/x.sum() * 100)
    tab_pct = tab.transpose()
    x = tab_pct.index.values
    plt.figure(figsize=(14,3))
    plt.plot(x, tab_pct['unknown'],color='green', label='unknown')
    plt.plot(x, tab_pct['yes'],color='blue', label='yes')
    plt.plot(x, tab_pct['no'],color='red', label='no')
    plt.legend() 
    plt.xlabel(col)
    plt.ylabel('rate')
    plt.show()

defaultAsso(df,'job')

defaultAsso(df,'education')

defaultAsso(df,'marital')

年龄需要转为年龄组来处理：

In [11]:

def get_age_group(age):
    if age <30:
        return 2
    elif age>60:
        return 6
    else:
        return age//10
df['ageGroup'] =df['age'].apply(lambda x:get_age_group(x))#打印年龄组的取值是否正确
defaultAsso(df,'ageGroup') #对照defualt与年龄组
df.drop('ageGroup',axis=1)#将新增的年龄组这一列删除

Out[11]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	previous	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
5	45	services	married	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
6	59	admin.	married	professional.course	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
7	41	blue-collar	married	unknown	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
8	24	technician	single	professional.course	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
9	25	services	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
10	41	blue-collar	married	unknown	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
11	25	services	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
12	29	blue-collar	single	high.school	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
13	57	housemaid	divorced	basic.4y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
14	35	blue-collar	married	basic.6y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
15	54	retired	married	basic.9y	unknown	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
16	35	blue-collar	married	basic.6y	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
17	46	blue-collar	married	basic.6y	unknown	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
18	50	blue-collar	married	basic.9y	no	yes	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
19	39	management	single	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
20	30	unemployed	married	high.school	no	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
21	55	blue-collar	married	basic.4y	unknown	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
22	55	retired	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
23	41	technician	single	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
24	37	admin.	married	high.school	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
25	35	technician	married	university.degree	no	no	yes	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
26	59	technician	married	unknown	no	yes	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
27	39	self-employed	married	basic.9y	unknown	no	no	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
28	54	technician	single	university.degree	unknown	no	no	telephone	may	mon	...	2	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
29	55	unknown	married	university.degree	unknown	unknown	unknown	telephone	may	mon	...	1	999	0	nonexistent	1.1	93.994	-36.4	4.857	5191.0	no
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
41158	35	technician	divorced	basic.4y	no	no	no	cellular	nov	tue	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.035	4963.6	yes
41159	35	technician	divorced	basic.4y	no	yes	no	cellular	nov	tue	...	1	9	4	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41160	33	admin.	married	university.degree	no	no	no	cellular	nov	tue	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.035	4963.6	yes
41161	33	admin.	married	university.degree	no	yes	no	cellular	nov	tue	...	1	999	1	failure	-1.1	94.767	-50.8	1.035	4963.6	no
41162	60	blue-collar	married	basic.4y	no	yes	no	cellular	nov	tue	...	2	4	1	success	-1.1	94.767	-50.8	1.035	4963.6	no
41163	35	technician	divorced	basic.4y	no	yes	no	cellular	nov	tue	...	3	4	2	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41164	54	admin.	married	professional.course	no	no	no	cellular	nov	tue	...	2	10	1	success	-1.1	94.767	-50.8	1.035	4963.6	yes
41165	38	housemaid	divorced	university.degree	no	no	no	cellular	nov	wed	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	yes
41166	32	admin.	married	university.degree	no	no	no	telephone	nov	wed	...	1	999	1	failure	-1.1	94.767	-50.8	1.030	4963.6	yes
41167	32	admin.	married	university.degree	no	yes	no	cellular	nov	wed	...	3	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41168	38	entrepreneur	married	university.degree	no	no	no	cellular	nov	wed	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41169	62	services	married	high.school	no	yes	no	cellular	nov	wed	...	5	999	0	nonexistent	-1.1	94.767	-50.8	1.030	4963.6	no
41170	40	management	divorced	university.degree	no	yes	no	cellular	nov	wed	...	2	999	4	failure	-1.1	94.767	-50.8	1.030	4963.6	no
41171	33	student	married	professional.course	no	yes	no	telephone	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	yes
41172	31	admin.	single	university.degree	no	yes	no	cellular	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	yes
41173	62	retired	married	university.degree	no	yes	no	cellular	nov	thu	...	1	999	2	failure	-1.1	94.767	-50.8	1.031	4963.6	yes
41174	62	retired	married	university.degree	no	yes	no	cellular	nov	thu	...	1	1	6	success	-1.1	94.767	-50.8	1.031	4963.6	yes
41175	34	student	single	unknown	no	yes	no	cellular	nov	thu	...	1	999	2	failure	-1.1	94.767	-50.8	1.031	4963.6	no
41176	38	housemaid	divorced	high.school	no	yes	yes	cellular	nov	thu	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	no
41177	57	retired	married	professional.course	no	yes	no	cellular	nov	thu	...	6	999	0	nonexistent	-1.1	94.767	-50.8	1.031	4963.6	no
41178	62	retired	married	university.degree	no	no	no	cellular	nov	thu	...	2	6	3	success	-1.1	94.767	-50.8	1.031	4963.6	yes
41179	64	retired	divorced	professional.course	no	yes	no	cellular	nov	fri	...	3	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41180	36	admin.	married	university.degree	no	no	no	cellular	nov	fri	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41181	37	admin.	married	university.degree	no	yes	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41182	29	unemployed	single	basic.4y	no	yes	no	cellular	nov	fri	...	1	9	1	success	-1.1	94.767	-50.8	1.028	4963.6	no
41183	73	retired	married	professional.course	no	yes	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41184	46	blue-collar	married	professional.course	no	no	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41185	56	retired	married	university.degree	no	yes	no	cellular	nov	fri	...	2	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	no
41186	44	technician	married	professional.course	no	no	no	cellular	nov	fri	...	1	999	0	nonexistent	-1.1	94.767	-50.8	1.028	4963.6	yes
41187	74	retired	married	professional.course	no	yes	no	cellular	nov	fri	...	3	999	1	failure	-1.1	94.767	-50.8	1.028	4963.6	no

41188 rows × 21 columns

根据以上分析，在数据处理中，将default变量的unknown与yes记录合并（使用map方法,将unknown与yes映射成同一个值）,然后使用value_counts()观察转换结果。

df['default']=df['default'].map({'unknown':1 ,'yes':1,'no':0})
df['default'].value_counts()

0    32588
1     8600
Name: default, dtype: int64

2.4 处理极少量缺失比例的变量

2.4.1 删除缺失记录

job和marital只有少量缺失，缺失值记录占比不到百分之一，这里要求将job和marital中取值为unknown的记录删除
删除记录后，调用value_counts()检查缺失值是否真的已经去除这里以job删除为例:

df.drop(df[df.job == 'unknown'].index,inplace = True,axis=0)
df.job.value_counts()

admin.           10422
blue-collar       9254
technician        6743
services          3969
management        2924
retired           1720
entrepreneur      1456
self-employed     1421
housemaid         1060
unemployed        1014
student            875
Name: job, dtype: int64

df.drop(df[df.marital == 'unknown'].index,inplace = True,axis=0)
df.marital.value_counts()

married     24694
single      11494
divorced     4599
Name: marital, dtype: int64

pd.crosstab(df['job'],df['marital'])

marital	divorced	married	single
job
admin.	1280	5253	3875
blue-collar	728	6687	1825
entrepreneur	179	1071	203
housemaid	161	777	119
management	331	2089	501
retired	348	1274	93
self-employed	133	904	379
services	532	2294	1137
student	9	41	824
technician	774	3670	2287
unemployed	124	634	251

df['housing'].value_counts()

yes        21376
no         18427
unknown      984
Name: housing, dtype: int64

df['loan'].value_counts()

no         33620
yes         6183
unknown      984
Name: loan, dtype: int64

2.4.2 处理关联的缺失值

从热力图上看，除了housing，loan与education的关系最为密切。因此使用交叉表观察housing和loan的关系。
删除housing的缺失记录
针对housing和loan分别调用value_counts()观察缺失值是否已经去除

pd.crosstab(df['housing'],df['loan'])
df.drop(df[df.housing == 'unknown'].index,inplace = True,axis=0)
df['housing'].value_counts()
 

yes    21376
no     18427
Name: housing, dtype: int64

df['loan'].value_counts()

no     33620
yes     6183
Name: loan, dtype: int64

pd.crosstab(df['housing'],df['loan'])

loan	no	yes
housing
no	15897	2530
yes	17723	3653

pd.crosstab(df['job'],df['loan'])

loan	no	yes
job
admin.	8472	1709
blue-collar	7636	1365
entrepreneur	1212	205
housemaid	874	154
management	2411	439
retired	1431	240
self-employed	1182	194
services	3263	599
student	709	142
technician	5596	988
unemployed	834	148

pd.crosstab(df['housing'],df['marital'])

marital	divorced	married	single
housing
no	2086	11273	5068
yes	2392	12837	6147

最后剩下education的缺失值尚未处理，由于缺失值数量有1.5k条记录，不宜直接删除，考虑使用随机森林进行缺失值补充。在将所有参数数值化之后进行统一处理

3. 将分类变量转为数值

分类变量数值化 为了能使分类变量参与模型计算，我们需要将分类变量数值化，也就是编码。因此尚未被编码的分类变量（教育、工作、违约、联系方式、住房和贷款）都需要进一步被转换为数值变量。
分类变量又可以分为二项分类变量、有序分类变量和无序分类变量。不同种类的分类变量编码方式也有区别。

3.1 只有两种取值的变量

二分类变量编码: 在本数据集中，变量y, default 、contact、housing 和loan 都是只有两种取值，即二分类变量，可对其进行0，1编码。Default在前面的步骤中取值已经被转为数字0和1。
要求：

使用map方法，将y 、contact、housing 和loan 的取值映射成数字0和1

使用df[['y','default','contact','housing','loan']].head()，观察以上变量已经被正确转换：

df['y'].value_counts()

no     35316
yes     4487
Name: y, dtype: int64

df['y'] = df['y'].map({'no':0, 'yes':1})
df['contact']=df['contact'].map({'cellular':0,'telephone':1})
df['housing'] = df['housing'].map({"no":0, "yes":1})
df['loan'] = df['loan'].map({"no":0, "yes":1})
df.y.value_counts()#检查目标变量，未发现缺失值

0    35316
1     4487
Name: y, dtype: int64

df[['y','default','contact','housing','loan']].head()

	default	contact	housing	loan
0	0	1	0	0
1	1	1	0	0
2	0	1	1	0
3	0	1	0	0
4	0	1	0	1

3.2 有序分类变量编码

观察education的取值，可以根据学历高低，认为变量education是有序分类变量，影响大小排序为"illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school", "professional.course", "university.degree", 变量影响由小到大的顺序编码为1、2、3、...，但是由于缺失值的存在，unknown将无法进行排序。为了处理方便，我们在这里先将unknown设置为0，后续再重新对该值进行修正。

完成转换之后，调用value_counts()观察education的转换结果是否正确。

values = ["unknown","illiterate", "basic.4y", "basic.6y", "basic.9y", "high.school",  "professional.course", "university.degree"]
levels = range(0,len(values))
dict_levels = dict(zip(values, levels))
for v in values:
    df.loc[df['education'] == v, 'education'] = dict_levels[v]
df['education'].value_counts()

7    11821
5     9244
4     5856
6     5100
2     4002
3     2204
0     1558
1       18
Name: education, dtype: int64

3.3 将无序分类变量转为虚拟变量

根据上文的输入变量描述，可以认为变量job，marital，poutcome，month，day_of_week为无序分类变量。需要说明的是，虽然变量month和day_of_week从时间角度是有序的，但是对于目标变量而言是无序的。对于无序分类变量，可以利用独热编码（one-hot）。
独热编码（one-hot）：又称为一位有效编码，主要是采用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。
独热编码的转换方法:

要求：

将本数据集中的无序分类变量（job，marital，poutcome，month，day_of_week）转为虚拟变量（one-hot编码）
调用df.info()观察转换后的变量变化

df = pd.get_dummies(df, columns = ['job','marital','poutcome','month','day_of_week'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41187
Data columns (total 49 columns):
age                     39803 non-null int64
education               39803 non-null int64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null int64
campaign                39803 non-null int64
pdays                   39803 non-null int64
previous                39803 non-null int64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(5), int64(12), uint8(32)
memory usage: 6.7 MB

4. 通过随机森林补充缺失值

对于education这个变量的缺失值，这里采用机器学习的方式来实现缺失值的预测。思路是通过其他变量的值，预测缺失值最可能的取值。
步骤：

将数据集切分为训练集和测试集。其中无education缺失的记录归入训练集；education缺失的记录归入测试集。education作为预测目标（注意，这里与本数据集以营销成功与否作为目标是不同的）
使用机器学习在训练集上学习，并且将学习结果应用在测试集中

参数：

trainX 训练集输入变量
trainY 训练集目标值
testX 测试集输入变量

from sklearn.ensemble import RandomForestClassifier
def train_predict_unknown(trainX, trainY, testX):
    forest = RandomForestClassifier(n_estimators=100)
    forest = forest.fit(trainX, trainY)
    test_predictY = forest.predict(testX).astype(int)
    return pd.DataFrame(test_predictY,index=testX.index)

# 将education值已知的记录作为训练集，education的值未知（等于0）记录放入测试集
test_data = df[df['education'] == 0]#education等于0的记录作为测试集
train_data = df[df['education'] != 0] #education不等于0的记录作为训练集
# 将education变量作为目标变量，将训练集分为目标变量和输入变量两个dataframe
trainY =train_data['education'] # 将education列放入trainY
trainX = train_data.drop('education', axis=1)  # 将education列从train_data中删除
testX =test_data.drop('education', axis=1)#将education列从testX中删除 

使用机器学习算法预测education的缺失值

test_data['education'] = train_predict_unknown(trainX, trainY, testX)

使用value_counts观察test_data的education变量的取值，看看缺失值是否都得到了补充：

test_data['education'].value_counts()

7    446
5    383
2    261
4    256
6    165
3     47
Name: education, dtype: int64

将测试集与训练集合并成一张表格：

df = pd.concat([train_data, test_data])
df.shape

(39803, 49)

观察合并后education变量的取值是否在1~7之间（缺失值0不存在），同时通过df.head()观察整个数据表的状况

train_data['education'].value_counts()
df.head()

	age	education	default	housing	loan	contact	duration	campaign	pdays	...	month_may	day_of_week_mon
0	56	2	0	0	0	1	261	1	999	...	1	1
1	57	5	1	0	0	1	149	1	999	...	1	1
2	37	5	0	1	0	1	226	1	999	...	1	1
3	40	3	0	0	0	1	151	1	999	...	1	1
4	56	5	0	0	1	1	307	1	999	...	1	1

5 rows × 49 columns

5.对数值变量进行标准化

并不是所有算法都需要对数值变量进行标准化的。一些算法对于变量是否标准化比较敏感，例如逻辑回归，支持向量机，神经网络等；而随机森林和决策树不需要变量的标准化。为了方便后续的机器学习算法选择，这里统一进行标准化。
在本例中，需要对所有的数值变量进行标准化，由于education作为有序数列，也需要进行标准化。

from sklearn.preprocessing import StandardScaler
def scaleColumns(data, cols_to_scale):
    scaler = StandardScaler()
    idx = data.index.values
    for col in cols_to_scale:
        x = scaler.fit_transform(pd.DataFrame(data[col]))
        data[col] = pd.DataFrame(x,columns=['col'],index=idx)
    return data

df = scaleColumns(df,numberVar+['education'])
df.head()

	age	education	default	housing	loan	contact	duration	campaign	pdays	previous	...	month_may	day_of_week_mon
0	1.539987	-1.925742	0	0	0	1	0.009489	-0.566762	0.194855	-0.349299	...	1	1
1	1.636117	-0.096859	1	0	0	1	-0.422339	-0.566762	0.194855	-0.349299	...	1	1
2	-0.286490	-0.096859	0	1	0	1	-0.125457	-0.566762	0.194855	-0.349299	...	1	1
3	0.001901	-1.316115	0	0	0	1	-0.414628	-0.566762	0.194855	-0.349299	...	1	1
4	1.539987	-0.096859	0	0	1	1	0.186846	-0.566762	0.194855	-0.349299	...	1	1

5 rows × 49 columns

6. 特征选择

一些情况下原始数据维度非常高，维度越高，数据在每个特征维度上的分布就越稀疏，这对机器学习算法基本都是灾难性（维度灾难）。当我们又没有办法挑选出有效的特征时，需要使用PCA等算法来降低数据维度，使得数据可以用于统计学习的算法。但是，如果能够挑选出少而精的特征了，那么PCA等降维算法没有很大必要。在本次实验中，数据集中的特征已经比较有代表性而且并不过多，所以应该不需要降维。
根据前文分析可知，duration（最后一次和用户的通话时间）只有在通话结束时才会知道该变量的值。营销的目的就是减少工作人员的工作量，如果已经完成了通话才对是否需要联系此用户进行预测是没有价值的。因此该变量不应该作为预测模型的一个输入变量。

删除duration这一列
使用shape、info方法观察数据集最终的变量数、记录

df.drop(['duration'],axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39803 entries, 0 to 41175
Data columns (total 49 columns):
age                     39803 non-null float64
education               39803 non-null float64
default                 39803 non-null int64
housing                 39803 non-null int64
loan                    39803 non-null int64
contact                 39803 non-null int64
duration                39803 non-null float64
campaign                39803 non-null float64
pdays                   39803 non-null float64
previous                39803 non-null float64
emp.var.rate            39803 non-null float64
cons.price.idx          39803 non-null float64
cons.conf.idx           39803 non-null float64
euribor3m               39803 non-null float64
nr.employed             39803 non-null float64
y                       39803 non-null int64
ageGroup                39803 non-null int64
job_admin.              39803 non-null uint8
job_blue-collar         39803 non-null uint8
job_entrepreneur        39803 non-null uint8
job_housemaid           39803 non-null uint8
job_management          39803 non-null uint8
job_retired             39803 non-null uint8
job_self-employed       39803 non-null uint8
job_services            39803 non-null uint8
job_student             39803 non-null uint8
job_technician          39803 non-null uint8
job_unemployed          39803 non-null uint8
marital_divorced        39803 non-null uint8
marital_married         39803 non-null uint8
marital_single          39803 non-null uint8
poutcome_failure        39803 non-null uint8
poutcome_nonexistent    39803 non-null uint8
poutcome_success        39803 non-null uint8
month_apr               39803 non-null uint8
month_aug               39803 non-null uint8
month_dec               39803 non-null uint8
month_jul               39803 non-null uint8
month_jun               39803 non-null uint8
month_mar               39803 non-null uint8
month_may               39803 non-null uint8
month_nov               39803 non-null uint8
month_oct               39803 non-null uint8
month_sep               39803 non-null uint8
day_of_week_fri         39803 non-null uint8
day_of_week_mon         39803 non-null uint8
day_of_week_thu         39803 non-null uint8
day_of_week_tue         39803 non-null uint8
day_of_week_wed         39803 non-null uint8
dtypes: float64(11), int64(6), uint8(32)
memory usage: 6.7 MB

6. 保存预处理数据

将预处理后的数据保存，后续进行机器学习时，就可以直接使用预处理后的数据，而不需要重新做预处理了。
要求：

由于原始数据集中，样本是按照时间顺序排列的，因此这里需要将其打乱，变成无序数据集，以免在训练过程中出现过拟合。
对数据集进行持久化（保存为.csv文件）,index=False表示不保存索引

from sklearn.utils import shuffle
df = shuffle(df)

df.to_csv('bank-preprocess.csv',index=False)

posted @ 2020-06-14 13:12 Mangnolia 阅读(4469) 评论(8) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

	age	education	default	housing	loan	contact	duration	campaign	pdays	...	month_may	day_of_week_mon
0	56	2	0	0	0	1	261	1	999	...	1	1
1	57	5	1	0	0	1	149	1	999	...	1	1
2	37	5	0	1	0	1	226	1	999	...	1	1
3	40	3	0	0	0	1	151	1	999	...	1	1
4	56	5	0	0	1	1	307	1	999	...	1	1

	age	education	default	housing	loan	contact	duration	campaign	pdays	...	month_may	day_of_week_mon
0	56	2	0	0	0	1	261	1	999	...	1	1
1	57	5	1	0	0	1	149	1	999	...	1	1
2	37	5	0	1	0	1	226	1	999	...	1	1
3	40	3	0	0	0	1	151	1	999	...	1	1
4	56	5	0	0	1	1	307	1	999	...	1	1