Logistic Regression:银行贷款申请审批实例

问题定义

这是一个贷款的审批问题,假设你是一个银行的贷款审批员,现在有客户需要一定额度的贷款,他们填写了个人的信息(信息在datas.txt中给出),你需要根据他们的信息,建立一个分类模型,判断是否可以给他们贷款。

请根据所给的信息,建立分类模型,评价模型,同时将模型建立过程简单介绍一下,同时对各特征进行简单的解释说明。

Dataset

用户id,年龄,性别,申请金额,职业类型,教育程度,婚姻状态,房屋类型,户口类型,贷款用途,公司类型,薪水,贷款标记:0不放贷,1同意放贷

Data preprocessing

在对数据进行建模时,用户ID是没有用的。在描述用户信息的几个维度数据中,年龄,申请金额,薪水是连续值,剩下的是离散值。

通过观察发现有些数据存在数据缺失的情况,需要对这些数据进行处理,比如直接删除或者通过缺失值补全。

The Logit Function

The Logistic Regression

Model Data

 1 #逻辑回归模型
 2 #对银行客户是否放贷进行分类
 3 
 4 import pandas
 5 import numpy
 6 import matplotlib.pyplot as plt
 7 from sklearn.linear_model import  LogisticRegression
 8 from sklearn.metrics import roc_curve, roc_auc_score
 9 
10 data = pandas.read_csv("datas.csv")
11 data = data.dropna()
12 
13 # Randomly shuffle our data for the training and test set
14 admissions = data.loc[numpy.random.permutation(data.index)]
15 
16 # train with 700 and test with the following 300, split dataset
17 num_train = 14968
18 data_train = admissions[:num_train]
19 data_test = admissions[num_train:]
20 
21 # Fit Logistic regression to admit with features using the training set
22 logistic_model = LogisticRegression()
23 logistic_model.fit(data_train[['Age','Gender','AppAmount','Occupation',
24                                'Education','Marital','Property','Residence',
25                                'LoanUse','Company','Salary']], data_train['Label'])
26 
27 # Print the Models Coefficients
28 print(logistic_model.coef_)
29 
30 # .predict() using a threshold of 0.50 by default
31 predicted = logistic_model.predict(data_train[['Age','Gender','AppAmount','Occupation',
32                                'Education','Marital','Property','Residence',
33                                'LoanUse','Company','Salary']])
34 
35 # The average of the binary array will give us the accuracy
36 accuracy_train = (predicted == data_train['Label']).mean()
37 
38 # Print the accuracy
39 print("Accuracy in Training Set = {s}".format(s=accuracy_train))
40 
41 # Predicted to be admitted
42 predicted = logistic_model.predict(data_test[['Age','Gender','AppAmount','Occupation',
43                                'Education','Marital','Property','Residence',
44                                'LoanUse','Company','Salary']])
45 
46 # What proportion of our predictions were true
47 accuracy_test = (predicted == data_test['Label']).mean()
48 print("Accuracy in Test Set = {s}".format(s=accuracy_test))
49 
50 
51 # Predict the chance of label from those in the training set
52 train_probs = logistic_model.predict_proba(data_train[['Age','Gender','AppAmount','Occupation',
53                                'Education','Marital','Property','Residence',
54                                'LoanUse','Company','Salary']])[:,1]
55 
56 test_probs = logistic_model.predict_proba(data_test[['Age','Gender','AppAmount','Occupation',
57                                'Education','Marital','Property','Residence',
58                                'LoanUse','Company','Salary']])[:,1]
59 
60 # Compute auc for training set
61 auc_train = roc_auc_score(data_train["Label"], train_probs)
62 
63 # Compute auc for test set
64 auc_test = roc_auc_score(data_test["Label"], test_probs)
65 
66 # Difference in auc values
67 auc_diff = auc_train - auc_test
68 
69 # Compute ROC Curves
70 roc_train = roc_curve(data_train["Label"], train_probs)
71 roc_test = roc_curve(data_test["Label"], test_probs)
72 
73 # Plot false positives by true positives
74 plt.plot(roc_train[0], roc_train[1])
75 plt.plot(roc_test[0], roc_test[1])

 

posted @ 2016-08-29 16:00  Black_Knight  阅读(671)  评论(0编辑  收藏  举报