机器学习—回归与分类4-1(决策树算法)

使用决策树预测德国人信贷风险

主要步骤流程:

  • 1. 导入包
  • 2. 导入数据集
  • 3. 数据预处理
    • 3.1 检测并处理缺失值
    • 3.2 处理类别型变量
    • 3.3 得到自变量和因变量
    • 3.4 拆分训练集和测试集
    • 3.5 特征缩放
  • 4. 使用不同的参数构建决策树模型
    • 4.1 模型1:构建决策树模型
      • 4.1.1 构建模型
      • 4.1.2 测试集做预测
      • 4.1.3 评估模型性能
      • 4.1.4 画出树形结构
    • 4.2 模型2:构建决策树模型
 
In [1]:
# 导入包
import numpy as np
import pandas as pd

 

2. 导入数据集

In [2]:
# 导入数据集
data = pd.read_csv("german_credit_data.csv")
data
Out[2]:
 NO.AgeSexJobHousingSaving accountsChecking accountCredit amountDurationPurposeRisk
0 0 67 male 2 own NaN little 1169 6 radio/TV good
1 1 22 female 2 own little moderate 5951 48 radio/TV bad
2 2 49 male 1 own little NaN 2096 12 education good
3 3 45 male 2 free little little 7882 42 furniture/equipment good
4 4 53 male 2 free little little 4870 24 car bad
... ... ... ... ... ... ... ... ... ... ... ...
995 995 31 female 1 own little NaN 1736 12 furniture/equipment good
996 996 40 male 3 own little little 3857 30 car good
997 997 38 male 2 own little NaN 804 12 radio/TV good
998 998 23 male 2 free little little 1845 45 radio/TV bad
999 999 27 male 2 own moderate moderate 4576 45 car good

1000 rows × 11 columns

 

3. 数据预处理

3.1 检测并处理缺失值

In [3]:
# 检测缺失值
null_df = data.isnull().sum() # 检测缺失值
null_df
Out[3]:
NO.                   0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64
In [4]:
# 处理Saving accounts 和 Checking account 这2个字段
for col in ['Saving accounts', 'Checking account']: # 处理缺失值
    data[col].fillna('none', inplace=True) # none说明这些人没有银行账户
In [5]:
# 检测缺失值
null_df = data.isnull().sum() 
null_df
Out[5]:
NO.                 0
Age                 0
Sex                 0
Job                 0
Housing             0
Saving accounts     0
Checking account    0
Credit amount       0
Duration            0
Purpose             0
Risk                0
dtype: int64

3.2 处理类别型变量

In [6]:
# 处理Job字段
print(data.dtypes)
NO.                  int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object
In [7]:
data['Job'] = data['Job'].astype('object')
In [8]:
print(data.dtypes)
NO.                  int64
Age                  int64
Sex                 object
Job                 object
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object
In [9]:
# 处理类别型变量
data = pd.get_dummies(data, drop_first = True)
data
Out[9]:
 NO.AgeCredit amountDurationSex_maleJob_1Job_2Job_3Housing_ownHousing_rent...Checking account_noneChecking account_richPurpose_carPurpose_domestic appliancesPurpose_educationPurpose_furniture/equipmentPurpose_radio/TVPurpose_repairsPurpose_vacation/othersRisk_good
0 0 67 1169 6 1 0 1 0 1 0 ... 0 0 0 0 0 0 1 0 0 1
1 1 22 5951 48 0 0 1 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
2 2 49 2096 12 1 1 0 0 1 0 ... 1 0 0 0 1 0 0 0 0 1
3 3 45 7882 42 1 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 1
4 4 53 4870 24 1 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
995 995 31 1736 12 0 1 0 0 1 0 ... 1 0 0 0 0 1 0 0 0 1
996 996 40 3857 30 1 0 0 1 1 0 ... 0 0 1 0 0 0 0 0 0 1
997 997 38 804 12 1 0 1 0 1 0 ... 1 0 0 0 0 0 1 0 0 1
998 998 23 1845 45 1 0 1 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
999 999 27 4576 45 1 0 1 0 1 0 ... 0 0 1 0 0 0 0 0 0 1

1000 rows × 25 columns

3.3 得到自变量和因变量

In [10]:
# 得到自变量和因变量
y = data['Risk_good'].values
data = data.drop(['Risk_good'], axis = 1)
x = data.values

3.4 拆分训练集和测试集

In [11]:
# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(800, 24)
(200, 24)
(800,)
(200,)

3.5 特征缩放

In [12]:
# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

 

4. 使用不同的参数构建决策树模型

4.1 模型1:构建决策树模型

4.1.1 构建模型

In [23]:
# 使用不同的参数构建决策树模型
# 模型1:构建决策树模型(criterion = 'entropy', max_depth = 3, min_samples_leaf = 50)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, min_samples_leaf = 10, random_state = 0)
classifier.fit(x_train, y_train)
Out[23]:
DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=10,
                       random_state=0)

4.1.2 测试集做预测

In [24]:
# 在测试集做预测
y_pred = classifier.predict(x_test)

4.1.3 评估模型性能

In [25]:
# 评估模型性能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))
0.685
In [26]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))
[[ 11  48]
 [ 15 126]]
In [27]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.42      0.19      0.26        59
           1       0.72      0.89      0.80       141

    accuracy                           0.69       200
   macro avg       0.57      0.54      0.53       200
weighted avg       0.64      0.69      0.64       200

4.1.4 画出树形结构

准备工作:

  1. 安装graphviz库,使用命令 pip install graphviz
  2. 安装graphviz应用程序,并添加环境变量
In [28]:
# 生成.dot文件(在Jupyter Notebook中运行会报错,但是.dot文件却成功生成了。在控制台运行不会报错。)
from sklearn import tree
import graphviz
graphviz.Source(tree.export_graphviz(classifier, out_file='output.dot'))
Out[28]:
<graphviz.files.Source at 0x1b38fe6b188>
注:下面代码在Jupyter无法运行,需要在控制台运行
In [ ]:
'''
# 将 dot 文件转换成图片文件或pdf文件
dot -Tpng output.dot -o output.png # 转换成png文件
dot -Tpdf output.dot -o output.pdf # 转换成pdf文件
'''

4.2 模型2:构建决策树模型

In [34]:
# 模型2:构建决策树模型(criterion = 'gini', max_depth = 9, min_samples_leaf = 10)
classifier = DecisionTreeClassifier(criterion = 'gini', max_depth = 5, min_samples_leaf = 10, min_samples_split=10, random_state = 0)
classifier.fit(x_train, y_train)
Out[34]:
DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, min_samples_split=10,
                       random_state=0)
In [35]:
# 在测试集做预测
y_pred = classifier.predict(x_test)
In [36]:
# 评估模型性能
print(accuracy_score(y_test, y_pred))
0.755
In [37]:
print(confusion_matrix(y_test, y_pred))
[[ 27  32]
 [ 17 124]]
In [38]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.61      0.46      0.52        59
           1       0.79      0.88      0.84       141
    accuracy                           0.76       200
   macro avg       0.70      0.67      0.68       200
weighted avg       0.74      0.76      0.74       200

 

结论: 由上面2个模型可见,不同超参数对模型性能的影响不同

 

posted @   Theext  阅读(276)  评论(0编辑  收藏  举报
相关博文:
阅读排行:
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 无需6万激活码!GitHub神秘组织3小时极速复刻Manus,手把手教你使用OpenManus搭建本
· C#/.NET/.NET Core优秀项目和框架2025年2月简报
· DeepSeek在M芯片Mac上本地化部署
点击右上角即可分享
微信分享提示