“魔镜杯”风控算法大赛

比赛概览

拍拍贷“魔镜风控系统”从平均400个数据维度评估用户当前的信用状态,给每个借款人打出当前状态的信用分,在此基础上,再结合新发标的信息,打出对于每个标的6个月内逾期率的预测,为投资人提供了关键的决策依据,促进健康高效的互联网金融。拍拍贷首次开放丰富而真实的历史数据,邀你PK“魔镜风控系统”,通过机器学习技术,你能设计出更具预测准确率和计算性能的违约预测算法吗?

比赛规则

参赛团队需要基于训练集数据构建预测模型,使用模型计算测试集的评分(评分数值越高,表示越有可能出现贷款违约)。

模型评价标准:本次比赛采用AUC来评判模型的效果。AUC即以False Positive Rate为横轴,True Positive Rate为纵轴的ROC (Receiver Operating Characteristic)curve下方的面积的大小。

比赛数据

本次大赛将公开国内网络借贷行业的贷款风险数据,包括信用违约标签(因变量)、建模所需的基础与加工字段(自变量)、相关用户的网络行为原始数据。本着保护借款人隐私以及拍拍贷知识产权的目的,数据字段已经过脱敏处理。

数据编码为GBK。初赛数据包括3万条训练集和2万条测试集。复赛会增加新的3万条数据,供参赛团队优化模型,并新增1万条数据作为测试集。所有训练集,测试集都包括3个csv文件。

初赛数据下载链接

复赛数据下载链接

数据类型说明文档下载链接

Master

每一行代表一个样本(一笔成功成交借款),每个样本包含200多个各类字段。

  • idx:每一笔贷款的unique key,可以与另外2个文件里的idx相匹配。
  • UserInfo_*:借款人特征字段
  • WeblogInfo_*:Info网络行为字段
  • Education_Info*:学历学籍字段
  • ThirdParty_Info_PeriodN_*:第三方数据时间段N字段
  • SocialNetwork_*:社交网络字段
  • LinstingInfo:借款成交时间
  • Target:违约标签(1 = 贷款违约,0 = 正常还款)。测试集里不包含target字段。

Log_Info

借款人的登陆信息。

  • ListingInfo:借款成交时间
  • LogInfo1:操作代码
  • LogInfo2:操作类别
  • LogInfo3:登陆时间
  • idx:每一笔贷款的unique key

Userupdate_Info

借款人修改信息

  • ListingInfo1:借款成交时间
  • UserupdateInfo1:修改内容
  • UserupdateInfo2:修改时间
  • idx:每一笔贷款的unique key
# 引入Package
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style='whitegrid')

import arrow
# 用arrow lib,把日期解析成年、月、日、周、星期几、月初/月中/月末。带入模型前进行one-hot encoding
def parse_date(date_str, str_format='YYYY/MM/DD'):
    d = arrow.get(date_str, str_format)
    # 月初,月中,月末
    month_stage = int((d.day-1) / 10) + 1
    return (d.timestamp, d.year, d.month, d.day, d.week, d.isoweekday(), month_stage)

# 显示列名
def show_cols(df):
    for c in df.columns:
        print(c)

读取数据

path = '/Training Set'
# path = './PPD-First-Round-Data-Update/Training Set'
train_master = pd.read_csv('PPD_Training_Master_GBK_3_1_Training_Set.csv', encoding='gbk')
train_loginfo = pd.read_csv('PPD_LogInfo_3_1_Training_Set.csv', encoding='gbk')
train_userinfo = pd.read_csv('PPD_Userupdate_Info_3_1_Training_Set.csv', encoding='gbk')

数据清洗

  • 删除数据缺失比例很大的列,比如超过20%为nan
  • 删除数据缺失比例大的行,并保持删除的行数不超过总体的1%
  • 填补剩余缺失值,通过value_count观察是连续/离散变量,然后用最高频/平均数填补nan。这里通过观察,而不是判断类型是否object,更贴近实际情况
# train_master中每一列的NULL值数量
null_sum = train_master.isnull().sum()
# train_master中每一列的NULL值数量不为0的
null_sum = null_sum[null_sum!=0]
null_sum_df = DataFrame(null_sum, columns=['num'])
# 缺失率
null_sum_df['ratio'] = null_sum_df['num'] / 30000.0
null_sum_df.sort_values(by='ratio', ascending=False, inplace=True)
print(null_sum_df.head(10))

# 删除缺失严重的列
train_master.drop(['WeblogInfo_3', 'WeblogInfo_1', 'UserInfo_11', 'UserInfo_13', 'UserInfo_12', 'WeblogInfo_20'],
                  axis=1, inplace=True)
                 num     ratio
WeblogInfo_3   29030  0.967667
WeblogInfo_1   29030  0.967667
UserInfo_11    18909  0.630300
UserInfo_13    18909  0.630300
UserInfo_12    18909  0.630300
WeblogInfo_20   8050  0.268333
WeblogInfo_21   3074  0.102467
WeblogInfo_19   2963  0.098767
WeblogInfo_2    1658  0.055267
WeblogInfo_4    1651  0.055033
# 删除缺失严重的行
record_nan = train_master.isnull().sum(axis=1).sort_values(ascending=False)
print(record_nan.head())
# 删除缺失数量>=5的行
drop_record_index = [i for i in record_nan.loc[(record_nan>=5)].index]
# 删除之前(30000, 222)
print('before train_master shape {}'.format(train_master.shape))
train_master.drop(drop_record_index, inplace=True)
# 删除之后(29189, 222)
print('after train_master shape {}'.format(train_master.shape))
# len(drop_record_index)
29341    33
18637    31
17386    31
29130    31
29605    31
dtype: int64
before train_master shape (30000, 222)
after train_master shape (29189, 222)
# 所有nan值的数量
print('before all nan num: {}'.format(train_master.isnull().sum().sum()))

# UserInfo_2为null的行,UserInfo_2列置为'位置地点'
train_master.loc[train_master['UserInfo_2'].isnull(), 'UserInfo_2'] = '位置地点'
# UserInfo_4为null的行,UserInfo_4列置为'位置地点'
train_master.loc[train_master['UserInfo_4'].isnull(), 'UserInfo_4'] = '位置地点'

def fill_nan(f, method):
    if method == 'most':
        # 填充为最高频
        common_value = pd.value_counts(train_master[f], ascending=False).index[0]
    else:
        # 填充为均值
        common_value = train_master[f].mean()
    train_master.loc[train_master[f].isnull(), f] = common_value

# 通过pd.value_counts(train_master[f])的观察得到经验
fill_nan('UserInfo_1', 'most')
fill_nan('UserInfo_3', 'most')
fill_nan('WeblogInfo_2', 'most')
fill_nan('WeblogInfo_4', 'mean')
fill_nan('WeblogInfo_5', 'mean')
fill_nan('WeblogInfo_6', 'mean')
fill_nan('WeblogInfo_19', 'most')
fill_nan('WeblogInfo_21', 'most')

print('after all nan num: {}'.format(train_master.isnull().sum().sum()))
before all nan num: 0
9725
13478
25688
24185
23997
after all nan num: 0

Feature 分类

  • 所有的分类中,如果其中最大频率的值出现超过一定阈值(50%),则把这列转化成为2值。比如[0,1,2,0,0,0,4,0,3]转化为[0,1,1,0,0,0,1,0,1]
  • 剩余的feature中,根据dtype,把所有features分为numerical和categorical 2类
  • numerical中,如果unique num不超过10个,也归属为categorical分类
ratio_threshold = 0.5
binarized_features = []
binarized_features_most_freq_value = []

# 不同period的third_party_feature均值汇总在一起,结果并不好,故取消
# third_party_features = []

# 遍历所有列,除去target之外的列进行如下处理
for f in train_master.columns:
    if f in ['target']:
        continue
    # 非空值的数量
    not_null_sum = (train_master[f].notnull()).sum()
    # 特征取值最多的index
    most_count = pd.value_counts(train_master[f], ascending=False).iloc[0]
    # 特征取值最多的值
    most_value = pd.value_counts(train_master[f], ascending=False).index[0]
    # 特征取值最多的值占所有取值的比率
    ratio = most_count / not_null_sum
    # 如果大于阈值则归入二值化的特征中
    if ratio > ratio_threshold:
        binarized_features.append(f)
        binarized_features_most_freq_value.append(most_value)

# 数值型特征(除去类型为object,除去'Idx', 'target',除去binarized_features)
numerical_features = [f for f in train_master.select_dtypes(exclude = ['object']).columns 
                      if f not in(['Idx', 'target']) and f not in binarized_features]

# 类别型特征(包括类型为object,除去'Idx', 'target',除去binarized_features)
categorical_features = [f for f in train_master.select_dtypes(include = ["object"]).columns 
                        if f not in(['Idx', 'target']) and f not in binarized_features]

# 遍历所有二值化特征,加名称前缀b_,将最多的取值置为0,其余的置为1
for i in range(len(binarized_features)):
    f = binarized_features[i]
    most_value = binarized_features_most_freq_value[i]
    train_master['b_' + f] = 1
    train_master.loc[train_master[f] == most_value, 'b_' + f] = 0
    train_master.drop([f], axis=1, inplace=True)
feature_unique_count = []
# 遍历数值型特征,统计各个特征取值不为0的数量
for f in numerical_features:
    feature_unique_count.append((np.count_nonzero(train_master[f].unique()), f))
    
# print(sorted(feature_unique_count))

# 遍历,将取值数量<=10的归为类别型特征
for c, f in feature_unique_count:
    if c <= 10:
        print('{} moved from numerical to categorical'.format(f))
        numerical_features.remove(f)
        categorical_features.append(f)
[(60, 'WeblogInfo_4'), (59, 'WeblogInfo_6'), (167, 'WeblogInfo_7'), (64, 'WeblogInfo_16'), (103, 'WeblogInfo_17'), (38, 'UserInfo_18'), (273, 'ThirdParty_Info_Period1_1'), (252, 'ThirdParty_Info_Period1_2'), (959, 'ThirdParty_Info_Period1_3'), (916, 'ThirdParty_Info_Period1_4'), (387, 'ThirdParty_Info_Period1_5'), (329, 'ThirdParty_Info_Period1_6'), (1217, 'ThirdParty_Info_Period1_7'), (563, 'ThirdParty_Info_Period1_8'), (111, 'ThirdParty_Info_Period1_11'), (18784, 'ThirdParty_Info_Period1_13'), (17989, 'ThirdParty_Info_Period1_14'), (5073, 'ThirdParty_Info_Period1_15'), (20047, 'ThirdParty_Info_Period1_16'), (14785, 'ThirdParty_Info_Period1_17'), (336, 'ThirdParty_Info_Period2_1'), (298, 'ThirdParty_Info_Period2_2'), (1192, 'ThirdParty_Info_Period2_3'), (1149, 'ThirdParty_Info_Period2_4'), (450, 'ThirdParty_Info_Period2_5'), (431, 'ThirdParty_Info_Period2_6'), (1524, 'ThirdParty_Info_Period2_7'), (715, 'ThirdParty_Info_Period2_8'), (134, 'ThirdParty_Info_Period2_11'), (21685, 'ThirdParty_Info_Period2_13'), (20719, 'ThirdParty_Info_Period2_14'), (6582, 'ThirdParty_Info_Period2_15'), (22385, 'ThirdParty_Info_Period2_16'), (18554, 'ThirdParty_Info_Period2_17'), (339, 'ThirdParty_Info_Period3_1'), (293, 'ThirdParty_Info_Period3_2'), (1172, 'ThirdParty_Info_Period3_3'), (1168, 'ThirdParty_Info_Period3_4'), (453, 'ThirdParty_Info_Period3_5'), (428, 'ThirdParty_Info_Period3_6'), (1511, 'ThirdParty_Info_Period3_7'), (707, 'ThirdParty_Info_Period3_8'), (129, 'ThirdParty_Info_Period3_11'), (21521, 'ThirdParty_Info_Period3_13'), (20571, 'ThirdParty_Info_Period3_14'), (6569, 'ThirdParty_Info_Period3_15'), (22247, 'ThirdParty_Info_Period3_16'), (18311, 'ThirdParty_Info_Period3_17'), (324, 'ThirdParty_Info_Period4_1'), (295, 'ThirdParty_Info_Period4_2'), (1183, 'ThirdParty_Info_Period4_3'), (1143, 'ThirdParty_Info_Period4_4'), (447, 'ThirdParty_Info_Period4_5'), (422, 'ThirdParty_Info_Period4_6'), (1524, 'ThirdParty_Info_Period4_7'), (706, 'ThirdParty_Info_Period4_8'), (130, 'ThirdParty_Info_Period4_11'), (20894, 'ThirdParty_Info_Period4_13'), (20109, 'ThirdParty_Info_Period4_14'), (6469, 'ThirdParty_Info_Period4_15'), (21644, 'ThirdParty_Info_Period4_16'), (17849, 'ThirdParty_Info_Period4_17'), (322, 'ThirdParty_Info_Period5_1'), (284, 'ThirdParty_Info_Period5_2'), (1144, 'ThirdParty_Info_Period5_3'), (1119, 'ThirdParty_Info_Period5_4'), (436, 'ThirdParty_Info_Period5_5'), (401, 'ThirdParty_Info_Period5_6'), (1470, 'ThirdParty_Info_Period5_7'), (685, 'ThirdParty_Info_Period5_8'), (126, 'ThirdParty_Info_Period5_11'), (20010, 'ThirdParty_Info_Period5_13'), (19145, 'ThirdParty_Info_Period5_14'), (6033, 'ThirdParty_Info_Period5_15'), (20723, 'ThirdParty_Info_Period5_16'), (17149, 'ThirdParty_Info_Period5_17'), (312, 'ThirdParty_Info_Period6_1'), (265, 'ThirdParty_Info_Period6_2'), (1074, 'ThirdParty_Info_Period6_3'), (1046, 'ThirdParty_Info_Period6_4'), (414, 'ThirdParty_Info_Period6_5'), (363, 'ThirdParty_Info_Period6_6'), (1411, 'ThirdParty_Info_Period6_7'), (637, 'ThirdParty_Info_Period6_8'), (71, 'ThirdParty_Info_Period6_9'), (15, 'ThirdParty_Info_Period6_10'), (123, 'ThirdParty_Info_Period6_11'), (95, 'ThirdParty_Info_Period6_12'), (16605, 'ThirdParty_Info_Period6_13'), (16170, 'ThirdParty_Info_Period6_14'), (5188, 'ThirdParty_Info_Period6_15'), (17220, 'ThirdParty_Info_Period6_16'), (14553, 'ThirdParty_Info_Period6_17')]

Feature Engineering 特征工程

numerical 数值型特征

  • 所有的numerical feature,画出在不同target下的分布图,stripplot(with jitter),类似于boxplot,不过更方便于大值outlier寻找
  • 绘制所有numerical features的密度图,并且可以观察出,它们都可以通过求对数转化为更接近正态分布
  • 转化为log分布后,可以再删除一些极小的outlier
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features])
print(melt.head(50))
print(melt.shape)
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
    target      variable      value
0        0  WeblogInfo_4   1.000000
1        0  WeblogInfo_4   1.000000
2        0  WeblogInfo_4   2.000000
3        0  WeblogInfo_4   3.027468
4        0  WeblogInfo_4   1.000000
5        0  WeblogInfo_4   2.000000
6        1  WeblogInfo_4  13.000000
7        0  WeblogInfo_4  12.000000
8        1  WeblogInfo_4  10.000000
9        0  WeblogInfo_4   1.000000
10       0  WeblogInfo_4   3.000000
11       0  WeblogInfo_4   1.000000
12       0  WeblogInfo_4  11.000000
13       1  WeblogInfo_4   1.000000
14       0  WeblogInfo_4   3.000000
15       0  WeblogInfo_4   2.000000
16       0  WeblogInfo_4   4.000000
17       0  WeblogInfo_4   4.000000
18       1  WeblogInfo_4   1.000000
19       0  WeblogInfo_4   2.000000
20       0  WeblogInfo_4   3.000000
21       0  WeblogInfo_4   3.000000
22       0  WeblogInfo_4   8.000000
23       0  WeblogInfo_4   1.000000
24       0  WeblogInfo_4   1.000000
25       0  WeblogInfo_4   2.000000
26       0  WeblogInfo_4   9.000000
27       0  WeblogInfo_4   2.000000
28       0  WeblogInfo_4   2.000000
29       0  WeblogInfo_4   2.000000
30       0  WeblogInfo_4   3.000000
31       0  WeblogInfo_4   6.000000
32       0  WeblogInfo_4   1.000000
33       0  WeblogInfo_4   3.000000
34       0  WeblogInfo_4   3.027468
35       0  WeblogInfo_4   6.000000
36       0  WeblogInfo_4   9.000000
37       0  WeblogInfo_4   2.000000
38       1  WeblogInfo_4   5.000000
39       0  WeblogInfo_4   2.000000
40       0  WeblogInfo_4   2.000000
41       0  WeblogInfo_4   3.000000
42       0  WeblogInfo_4   3.027468
43       0  WeblogInfo_4  15.000000
44       0  WeblogInfo_4   2.000000
45       0  WeblogInfo_4   3.000000
46       0  WeblogInfo_4   3.000000
47       0  WeblogInfo_4   2.000000
48       0  WeblogInfo_4   3.000000
49       0  WeblogInfo_4   2.000000
(2714577, 3)


E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)





<seaborn.axisgrid.FacetGrid at 0x4491c80860>

png

# Seaborn画图,查看正负样本中特征取值的分布,删除离群值

print('{} lines before drop'.format(train_master.shape[0]))

train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_1 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_2 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_3 > 2000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_3 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period6_4 > 1500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_7 > 2000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_6 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_5 > 1000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1500)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_8 > 1000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_16 > 2000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_14 > 1000000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_12 > 60)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 120) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_11 > 20) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 200000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_13 > 150000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_15 > 40000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period6_17 > 130000) & (train_master.target == 0)].index, inplace=True)


train_master.drop(train_master[train_master.ThirdParty_Info_Period5_1 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_2 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 3000) & (train_master.target == 0)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_3 > 2000)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_5 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_4 > 2000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_6 > 700].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_6 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_7 > 4000)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_8 > 800)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period5_11 > 200)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_13 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_15 > 75000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_16 > 180000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period5_17 > 150000].index, inplace=True)

# go above

train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_1 > 400)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_2 > 350)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_3 > 1500)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_4 > 1600].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_5 > 500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period4_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_13 > 250000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_15 > 70000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_16 > 210000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period4_17 > 160000].index, inplace=True)


train_master.drop(train_master[train_master.ThirdParty_Info_Period3_1 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_2 > 380].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_3 > 1750].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_4 > 1750].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_4 > 1250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_5 > 600].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period3_7 > 1600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_8 > 1000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_14 > 200000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_15 > 80000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period3_17 > 150000].index, inplace=True)


train_master.drop(train_master[train_master.ThirdParty_Info_Period2_1 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_1 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_2 > 400].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_2 > 300) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_3 > 1800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_3 > 1500) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_4 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_5 > 580].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_6 > 800].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_6 > 400) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_7 > 2100].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_8 > 700) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_11 > 120].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_13 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_14 > 170000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_15 > 80000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period2_15 > 50000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_16 > 300000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period2_17 > 150000].index, inplace=True)


train_master.drop(train_master[train_master.ThirdParty_Info_Period1_1 > 350].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_1 > 200) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_2 > 300].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_2 > 190) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_3 > 1500].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_4 > 1250].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_5 > 400].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_6 > 500].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_6 > 250) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_7 > 1800].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_8 > 720].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_8 > 600) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_11 > 100].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_13 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_13 > 140000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_14 > 150000].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_15 > 70000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_15 > 30000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_16 > 200000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_16 > 100000) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.ThirdParty_Info_Period1_17 > 100000].index, inplace=True)
train_master.drop(train_master[(train_master.ThirdParty_Info_Period1_17 > 80000) & (train_master.target == 1)].index, inplace=True)

train_master.drop(train_master[train_master.WeblogInfo_4 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_6 > 40].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_7 > 150].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_16 > 50].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_16 > 25) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.WeblogInfo_17 > 100].index, inplace=True)
train_master.drop(train_master[(train_master.WeblogInfo_17 > 80) & (train_master.target == 1)].index, inplace=True)
train_master.drop(train_master[train_master.UserInfo_18 < 10].index, inplace=True)

print('{} lines after drop'.format(train_master.shape[0]))
29189 lines before drop
28074 lines after drop
# melt = pd.melt(train_master, id_vars=['target'], value_vars = [f for f in numerical_features if f != 'Idx'])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
E:\Anaconda3\envs\sklearn\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<seaborn.axisgrid.FacetGrid at 0x44984f02e8>

png

# train_master_log = train_master.copy()
numerical_features_log = [f for f in numerical_features if f not in ['Idx']]

# 将数值型特征取log
for f in numerical_features_log:
    train_master[f + '_log'] = np.log1p(train_master[f])
    train_master.drop([f], axis=1, inplace=True)

E:\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log1p
from math import inf

(train_master == -inf).sum().sum()
206845
train_master.replace(-inf, -1, inplace=True)
# log后的密度图,应该分布靠近正态分布了
melt = pd.melt(train_master, id_vars=['target'], value_vars = [f+'_log' for f in numerical_features])
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.distplot, "value")
<seaborn.axisgrid.FacetGrid at 0x44f45c2470>

png

# log后的分布图,看是否有log后的outlier
g = sns.FacetGrid(data=melt, col="variable", col_wrap=4, sharex=False, sharey=False)
g.map(sns.stripplot, 'target', 'value', jitter=True, palette="muted")
E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the stripplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)





<seaborn.axisgrid.FacetGrid at 0x44e4270908>

png

categorical 类别型特征

melt = pd.melt(train_master, id_vars=['target'], value_vars=[f for f in categorical_features])
g = sns.FacetGrid(melt, col='variable', col_wrap=4, sharex=False, sharey=False)
g.map(sns.countplot, 'value', palette="muted")
E:\Anaconda3\envs\sklearn\lib\site-packages\seaborn\axisgrid.py:715: UserWarning: Using the countplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)





<seaborn.axisgrid.FacetGrid at 0x44e4c3eac8>

png

相关性查看

target_corr = np.abs(train_master.corr()['target']).sort_values(ascending=False)
target_corr
target                            1.000000
ThirdParty_Info_Period6_5_log     0.139606
ThirdParty_Info_Period6_11_log    0.139083
ThirdParty_Info_Period6_4_log     0.137962
ThirdParty_Info_Period6_7_log     0.135729
ThirdParty_Info_Period6_3_log     0.132310
ThirdParty_Info_Period6_14_log    0.131138
ThirdParty_Info_Period6_8_log     0.130577
ThirdParty_Info_Period6_16_log    0.128451
ThirdParty_Info_Period6_13_log    0.128013
ThirdParty_Info_Period5_5_log     0.126701
ThirdParty_Info_Period6_17_log    0.126456
ThirdParty_Info_Period5_4_log     0.121786
ThirdParty_Info_Period6_10_log    0.121729
ThirdParty_Info_Period6_1_log     0.121112
ThirdParty_Info_Period5_11_log    0.117162
ThirdParty_Info_Period5_7_log     0.114794
ThirdParty_Info_Period6_2_log     0.112041
ThirdParty_Info_Period6_9_log     0.112039
ThirdParty_Info_Period5_14_log    0.111374
ThirdParty_Info_Period5_3_log     0.108039
ThirdParty_Info_Period5_16_log    0.104786
ThirdParty_Info_Period6_12_log    0.104733
ThirdParty_Info_Period5_13_log    0.104688
ThirdParty_Info_Period5_1_log     0.104191
ThirdParty_Info_Period5_8_log     0.102859
ThirdParty_Info_Period4_5_log     0.101329
ThirdParty_Info_Period5_17_log    0.100960
ThirdParty_Info_Period4_4_log     0.094715
ThirdParty_Info_Period5_2_log     0.090261
                                    ...   
ThirdParty_Info_Period4_15_log    0.004560
b_ThirdParty_Info_Period4_12      0.004331
b_WeblogInfo_13                   0.004090
b_SocialNetwork_4                 0.003752
b_SocialNetwork_3                 0.003752
b_SocialNetwork_2                 0.003752
b_SocialNetwork_16                0.003711
b_SocialNetwork_6                 0.003701
b_SocialNetwork_5                 0.003701
b_WeblogInfo_44                   0.003542
WeblogInfo_7_log                  0.003414
b_WeblogInfo_32                   0.002961
WeblogInfo_16_log                 0.002954
b_ThirdParty_Info_Period2_12      0.002925
b_WeblogInfo_29                   0.002550
b_WeblogInfo_41                   0.002522
ThirdParty_Info_Period4_6_log     0.002362
b_WeblogInfo_11                   0.002257
b_WeblogInfo_12                   0.002209
b_WeblogInfo_8                    0.001922
b_WeblogInfo_40                   0.001759
b_WeblogInfo_36                   0.001554
b_WeblogInfo_26                   0.001357
ThirdParty_Info_Period1_3_log     0.000937
b_WeblogInfo_31                   0.000896
b_WeblogInfo_23                   0.000276
ThirdParty_Info_Period1_8_log     0.000194
b_WeblogInfo_38                   0.000077
b_WeblogInfo_10                        NaN
b_WeblogInfo_49                        NaN
Name: target, Length: 215, dtype: float64
# at_home,猜测UserInfo_2和UserInfo_8可能表示用户的当前居住地和户籍地,从而判断用户是否在老家。
train_master['at_home'] = np.where(train_master['UserInfo_2']==train_master['UserInfo_8'], 1, 0)
train_master['at_home']
0        1
1        1
2        1
3        1
4        1
5        0
6        0
7        0
9        0
10       1
11       1
12       1
13       1
14       0
15       1
16       0
17       1
18       1
19       1
20       1
21       0
22       0
23       0
24       0
25       1
26       1
27       1
28       1
29       0
30       1
        ..
29970    0
29971    1
29972    1
29973    0
29974    0
29975    1
29976    0
29977    1
29978    1
29979    1
29980    0
29981    0
29982    1
29983    0
29984    0
29985    0
29986    0
29987    1
29988    1
29989    1
29990    0
29991    1
29992    1
29993    1
29994    0
29995    1
29996    1
29997    0
29998    0
29999    1
Name: at_home, Length: 28074, dtype: int32
train_master_ = train_master.copy()
def parse_ListingInfo(date):
    d = parse_date(date, 'YYYY/M/D')
    return Series(d, 
                  index=['ListingInfo_timestamp', 'ListingInfo_year', 'ListingInfo_month',
                           'ListingInfo_day', 'ListingInfo_week', 'ListingInfo_isoweekday', 'ListingInfo_month_stage'], 
                  dtype=np.int32)

ListingInfo_parsed = train_master_['ListingInfo'].apply(parse_ListingInfo)
print('before train_master_ shape {}'.format(train_master_.shape))
train_master_ = train_master_.merge(ListingInfo_parsed, how='left', left_index=True, right_index=True)
print('after train_master_ shape {}'.format(train_master_.shape))
before train_master_ shape (28074, 223)
after train_master_ shape (28074, 230)

train_loginfo 借款人的登陆信息

  • 对Idx做group,提取记录数,LogInfo1独立数,活跃日期数,日期跨度
def loginfo_aggr(group):
    # group的数量
    loginfo_num = group.shape[0]
    # 操作代码的数量
    loginfo_LogInfo1_unique_num = group['LogInfo1'].unique().shape[0]
    # 登录时间的数量
    loginfo_active_day_num = group['LogInfo3'].unique().shape[0]
    # 处理登录时间最小值
    min_day = parse_date(np.min(group['LogInfo3']), str_format='YYYY-MM-DD')
    # 处理登陆时间最大值
    max_day = parse_date(np.max(group['LogInfo3']), str_format='YYYY-MM-DD')
    # 最大值和最小值相差多少天
    gap_day = round((max_day[0] - min_day[0]) / 86400)

    indexes = {
        'loginfo_num': loginfo_num, 
        'loginfo_LogInfo1_unique_num': loginfo_LogInfo1_unique_num, 
        'loginfo_active_day_num': loginfo_active_day_num, 
        'loginfo_gap_day': gap_day, 
        'loginfo_last_day_timestamp': max_day[0]
    }
    
    # TODO every individual LogInfo1,LogInfo2 count

    def sub_aggr_loginfo(sub_group):
        return sub_group.shape[0]

    sub_group = group.groupby(by=['LogInfo1', 'LogInfo2']).apply(sub_aggr_loginfo)
    indexes['loginfo_LogInfo12_unique_num'] = sub_group.shape[0]
    return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
    
train_loginfo_grouped = train_loginfo.groupby(by=['Idx']).apply(loginfo_aggr)
train_loginfo_grouped.head()
loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
Idx
3 26 4 8 63 1383264000 9
5 11 6 4 13 1383696000 8
8 125 7 13 12 1383696000 11
12 199 8 11 328 1383264000 14
16 15 4 7 8 1383523200 6
train_loginfo_grouped.to_csv('train_loginfo_grouped.csv', header=True, index=True)
train_loginfo_grouped = pd.read_csv('train_loginfo_grouped.csv')
train_loginfo_grouped.head()
Idx loginfo_num loginfo_LogInfo1_unique_num loginfo_active_day_num loginfo_gap_day loginfo_last_day_timestamp loginfo_LogInfo12_unique_num
0 3 26 4 8 63 1383264000 9
1 5 11 6 4 13 1383696000 8
2 8 125 7 13 12 1383696000 11
3 12 199 8 11 328 1383264000 14
4 16 15 4 7 8 1383523200 6

train_userinfo 借款人修改信息

  • 对于Idx做group,提取记录数,UserupdateInfo1独立数、UserupdateInfo1/UserupdateInfo2独立数,日期跨度。以及每种UserupdateInfo1/UserupdateInfo2的数量
def userinfo_aggr(group):
    op_columns = ['_EducationId', '_HasBuyCar', '_LastUpdateDate',
       '_MarriageStatusId', '_MobilePhone', '_QQ', '_ResidenceAddress',
       '_ResidencePhone', '_ResidenceTypeId', '_ResidenceYears', '_age',
       '_educationId', '_gender', '_hasBuyCar', '_idNumber',
       '_lastUpdateDate', '_marriageStatusId', '_mobilePhone', '_qQ',
       '_realName', '_regStepId', '_residenceAddress', '_residencePhone',
       '_residenceTypeId', '_residenceYears', '_IsCash', '_CompanyPhone',
       '_IdNumber', '_Phone', '_RealName', '_CompanyName', '_Age',
       '_Gender', '_OtherWebShopType', '_turnover', '_WebShopTypeId',
       '_RelationshipId', '_CompanyAddress', '_Department',
       '_flag_UCtoBcp', '_flag_UCtoPVR', '_WorkYears', '_ByUserId',
       '_DormitoryPhone', '_IncomeFrom', '_CompanyTypeId',
       '_CompanySizeId', '_companyTypeId', '_department',
       '_companyAddress', '_workYears', '_contactId', '_creationDate',
       '_flag_UCtoBCP', '_orderId', '_phone', '_relationshipId', '_userId',
       '_companyName', '_companyPhone', '_isCash', '_BussinessAddress',
       '_webShopUrl', '_WebShopUrl', '_SchoolName', '_HasBusinessLicense',
       '_dormitoryPhone', '_incomeFrom', '_schoolName', '_NickName',
       '_CreationDate', '_CityId', '_DistrictId', '_ProvinceId',
       '_GraduateDate', '_GraduateSchool', '_IdAddress', '_companySizeId',
       '_HasPPDaiAccount', '_PhoneType', '_PPDaiAccount', '_SecondEmail',
       '_SecondMobile', '_nickName', '_HasSbOrGjj', '_Position']

    # group的数量
    userinfo_num = group.shape[0]
    # 修改内容的数量
    userinfo_unique_num = group['UserupdateInfo1'].unique().shape[0]
    # 修改时间的数量
    userinfo_active_day_num = group['UserupdateInfo2'].unique().shape[0]
    # 处理修改时间的最小值
    min_day = parse_date(np.min(group['UserupdateInfo2']))
    # 处理修改时间的最大值
    max_day = parse_date(np.max(group['UserupdateInfo2']))
    # 最小值和最大值相差几天
    gap_day = round((max_day[0] - min_day[0]) / (86400))

    indexes = {
        'userinfo_num': userinfo_num, 
        'userinfo_unique_num': userinfo_unique_num, 
        'userinfo_active_day_num': userinfo_active_day_num, 
        'userinfo_gap_day': gap_day, 
        'userinfo_last_day_timestamp': max_day[0]
    }
    
    for c in op_columns:
        indexes['userinfo' + c + '_num'] = 0

    def sub_aggr(sub_group):
        return sub_group.shape[0]

    sub_group = group.groupby(by=['UserupdateInfo1']).apply(sub_aggr)
    for c in sub_group.index:
        indexes['userinfo' + c + '_num'] = sub_group.loc[c]
    return Series(data=[indexes[c] for c in indexes], index=[c for c in indexes])
    
train_userinfo_grouped = train_userinfo.groupby(by=['Idx']).apply(userinfo_aggr)
train_userinfo_grouped.head()
userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num userinfo_MobilePhone_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
Idx
3 13 11 1 0 1377820800 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
5 13 11 1 0 1382572800 1 1 2 1 2 ... 0 0 0 0 0 0 0 0 0 0
8 14 12 2 10 1383523200 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0
12 14 14 2 298 1380672000 1 1 1 1 0 ... 0 0 0 0 0 0 0 0 0 0
16 13 12 2 9 1383609600 1 1 1 1 2 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 91 columns

train_userinfo_grouped.to_csv('train_userinfo_grouped.csv', header=True, index=True)
train_userinfo_grouped = pd.read_csv('train_userinfo_grouped.csv')
train_userinfo_grouped.head()
Idx userinfo_num userinfo_unique_num userinfo_active_day_num userinfo_gap_day userinfo_last_day_timestamp userinfo_EducationId_num userinfo_HasBuyCar_num userinfo_LastUpdateDate_num userinfo_MarriageStatusId_num ... userinfo_IdAddress_num userinfo_companySizeId_num userinfo_HasPPDaiAccount_num userinfo_PhoneType_num userinfo_PPDaiAccount_num userinfo_SecondEmail_num userinfo_SecondMobile_num userinfo_nickName_num userinfo_HasSbOrGjj_num userinfo_Position_num
0 3 13 11 1 0 1377820800 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
1 5 13 11 1 0 1382572800 1 1 2 1 ... 0 0 0 0 0 0 0 0 0 0
2 8 14 12 2 10 1383523200 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
3 12 14 14 2 298 1380672000 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0
4 16 13 12 2 9 1383609600 1 1 1 1 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 92 columns

print('before merge, train_master shape:{}'.format(train_master_.shape))

# train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_index=True)
# train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_index=True)

train_master_ = train_master_.merge(train_loginfo_grouped, how='left', left_on='Idx', right_on='Idx')
train_master_ = train_master_.merge(train_userinfo_grouped, how='left', left_on='Idx', right_on='Idx')

train_master_.fillna(0, inplace=True)

print('after merge, train_master shape:{}'.format(train_master_.shape))
before merge, train_master shape:(28074, 230)
after merge, train_master shape:(28074, 327)

one-hot encoding features

这里不要自动推算get_dummies所使用的列,pandas会自动选择object类型,而有些非object feature,实际含义也是categorical的,也需要被one-hot encoding

drop_columns = ['Idx', 'ListingInfo', 'UserInfo_20',  'UserInfo_19', 'UserInfo_8', 'UserInfo_7', 
                'UserInfo_4','UserInfo_2',
               'ListingInfo_timestamp', 'loginfo_last_day_timestamp', 'userinfo_last_day_timestamp']
train_master_ = train_master_.drop(drop_columns, axis=1)

dummy_columns = categorical_features.copy()
dummy_columns.extend(['ListingInfo_year', 'ListingInfo_month', 'ListingInfo_day', 'ListingInfo_week', 
                      'ListingInfo_isoweekday', 'ListingInfo_month_stage'])
finally_dummy_columns = []

for c in dummy_columns:
    if c not in drop_columns:
        finally_dummy_columns.append(c)

print('before get_dummies train_master_ shape {}'.format(train_master_.shape))
train_master_ = pd.get_dummies(train_master_, columns=finally_dummy_columns)
print('after get_dummies train_master_ shape {}'.format(train_master_.shape))
before get_dummies train_master_ shape (28074, 316)
after get_dummies train_master_ shape (28074, 444)

normalized

from sklearn.preprocessing import StandardScaler

X_train = train_master_.drop(['target'], axis=1)
X_train = StandardScaler().fit_transform(X_train)
y_train = train_master_['target']
print(X_train.shape, y_train.shape)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\preprocessing\data.py:617: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
  return self.partial_fit(X, y)


(28074, 443) (28074,)


E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\base.py:462: DataConversionWarning: Data with input dtype uint8, int32, int64, float64 were all converted to float64 by StandardScaler.
  return self.fit(X, **fit_params).transform(X)
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
# from scikitplot import plotters as skplt

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC, LinearSVC

# 使用StratifiedKFold分层采样,保证预测target的分布合理,并且shuffle随机
cv = StratifiedKFold(n_splits=3, shuffle=True)

# 计算auc accuracy recall
def estimate(estimator, name='estimator'):
    auc = cross_val_score(estimator, X_train, y_train, scoring='roc_auc', cv=cv).mean()
    accuracy = cross_val_score(estimator, X_train, y_train, scoring='accuracy', cv=cv).mean()
    recall = cross_val_score(estimator, X_train, y_train, scoring='recall', cv=cv).mean()

    print("{}: auc:{:f}, recall:{:f}, accuracy:{:f}".format(name, auc, recall, accuracy))

#     skplt.plot_learning_curve(estimator, X_train, y_train)
#     plt.show()

#     estimator.fit(X_train, y_train)
#     y_probas = estimator.predict_proba(X_train)
#     skplt.plot_roc_curve(y_true=y_train, y_probas=y_probas)
#     plt.show()
estimate(XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic'), 'XGBClassifier')
estimate(RidgeClassifier(), 'RidgeClassifier')
estimate(LogisticRegression(), 'LogisticRegression')
# estimate(RandomForestClassifier(), 'RandomForestClassifier')
estimate(AdaBoostClassifier(), 'AdaBoostClassifier')
# estimate(SVC(), 'SVC')# too long to wait
# estimate(LinearSVC(), 'LinearSVC')

# XGBClassifier: auc:0.747668, recall:0.000000, accuracy:0.944575
# RidgeClassifier: auc:0.754218, recall:0.000000, accuracy:0.944433
# LogisticRegression: auc:0.758454, recall:0.015424, accuracy:0.942010
# AdaBoostClassifier: auc:0.784086, recall:0.013495, accuracy:0.943791
XGBClassifier: auc:0.755890, recall:0.000000, accuracy:0.944575
RidgeClassifier: auc:0.753939, recall:0.000000, accuracy:0.944575


E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


LogisticRegression: auc:0.759646, recall:0.022494, accuracy:0.942438
AdaBoostClassifier: auc:0.792333, recall:0.017988, accuracy:0.943827

VotingClassifier

from sklearn.ensemble import VotingClassifier

estimators = []
# estimators.append(('RidgeClassifier', RidgeClassifier()))
estimators.append(('LogisticRegression', LogisticRegression()))
estimators.append(('XGBClassifier', XGBClassifier(learning_rate=0.1, n_estimators=20, objective='binary:logistic')))
estimators.append(('AdaBoostClassifier', AdaBoostClassifier()))
# estimators.append(('RandomForestClassifier', RandomForestClassifier()))

#voting: auc:0.794587, recall:0.000642, accuracy:0.944433

voting = VotingClassifier(estimators = estimators, voting='soft')
estimate(voting, 'voting')
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
E:\Anaconda3\envs\sklearn\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)


voting: auc:0.790281, recall:0.000642, accuracy:0.944361
posted @ 2019-03-19 11:21  chenxiangzhen  阅读(1027)  评论(0编辑  收藏  举报