1. 导入包
2. 导入数据集
3. 数据预处理
- 3.1 检测并处理缺失值
- 3.2 处理类别型变量
- 3.3 得到自变量和因变量
- 3.4 拆分训练集和测试集
- 3.5 特征缩放
4. 使用不同的参数构建决策树模型
- 4.1 模型1：构建决策树模型
  - 4.1.1 构建模型
  - 4.1.2 测试集做预测
  - 4.1.3 评估模型性能
  - 4.1.4 画出树形结构
- 4.2 模型2：构建决策树模型

数据集链接：https://www.heywhale.com/mw/dataset/6230522b5a102300170f5c42/file

1. 导入包

In [1]:

# 导入包
import numpy as np
import pandas as pd

2. 导入数据集

In [2]:

# 导入数据集
data = pd.read_csv("german_credit_data.csv")
data

Out[2]:

	NO.	Age	Sex	Job	Housing	Saving accounts	Checking account	Credit amount	Duration	Purpose	Risk
0	0	67	male	2	own	NaN	little	1169	6	radio/TV	good
1	1	22	female	2	own	little	moderate	5951	48	radio/TV	bad
2	2	49	male	1	own	little	NaN	2096	12	education	good
3	3	45	male	2	free	little	little	7882	42	furniture/equipment	good
4	4	53	male	2	free	little	little	4870	24	car	bad
...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	female	1	own	little	NaN	1736	12	furniture/equipment	good
996	996	40	male	3	own	little	little	3857	30	car	good
997	997	38	male	2	own	little	NaN	804	12	radio/TV	good
998	998	23	male	2	free	little	little	1845	45	radio/TV	bad
999	999	27	male	2	own	moderate	moderate	4576	45	car	good

1000 rows × 11 columns

3. 数据预处理

3.1 检测并处理缺失值

In [3]:

# 检测缺失值
null_df = data.isnull().sum() # 检测缺失值
null_df

Out[3]:

NO.                   0
Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
Risk                  0
dtype: int64

In [4]:

# 处理Saving accounts 和 Checking account 这2个字段
for col in ['Saving accounts', 'Checking account']: # 处理缺失值
    data[col].fillna('none', inplace=True) # none说明这些人没有银行账户

In [5]:

# 检测缺失值
null_df = data.isnull().sum() 
null_df

Out[5]:

NO.                 0
Age                 0
Sex                 0
Job                 0
Housing             0
Saving accounts     0
Checking account    0
Credit amount       0
Duration            0
Purpose             0
Risk                0
dtype: int64

3.2 处理类别型变量

In [6]:

# 处理Job字段
print(data.dtypes)

NO.                  int64
Age                  int64
Sex                 object
Job                  int64
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object

In [7]:

data['Job'] = data['Job'].astype('object')

In [8]:

print(data.dtypes)

NO.                  int64
Age                  int64
Sex                 object
Job                 object
Housing             object
Saving accounts     object
Checking account    object
Credit amount        int64
Duration             int64
Purpose             object
Risk                object
dtype: object

In [9]:

# 处理类别型变量
data = pd.get_dummies(data, drop_first = True)
data

Out[9]:

	NO.	Age	Credit amount	Duration	Sex_male	Job_1	Job_2	Job_3	Housing_own	Housing_rent	...	Checking account_none	Checking account_rich	Purpose_car	Purpose_domestic appliances	Purpose_education	Purpose_furniture/equipment	Purpose_radio/TV	Purpose_repairs	Purpose_vacation/others	Risk_good
0	0	67	1169	6	1	0	1	0	1	0	...	0	0	0	0	0	0	1	0	0	1
1	1	22	5951	48	0	0	1	0	1	0	...	0	0	0	0	0	0	1	0	0	0
2	2	49	2096	12	1	1	0	0	1	0	...	1	0	0	0	1	0	0	0	0	1
3	3	45	7882	42	1	0	1	0	0	0	...	0	0	0	0	0	1	0	0	0	1
4	4	53	4870	24	1	0	1	0	0	0	...	0	0	1	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
995	995	31	1736	12	0	1	0	0	1	0	...	1	0	0	0	0	1	0	0	0	1
996	996	40	3857	30	1	0	0	1	1	0	...	0	0	1	0	0	0	0	0	0	1
997	997	38	804	12	1	0	1	0	1	0	...	1	0	0	0	0	0	1	0	0	1
998	998	23	1845	45	1	0	1	0	0	0	...	0	0	0	0	0	0	1	0	0	0
999	999	27	4576	45	1	0	1	0	1	0	...	0	0	1	0	0	0	0	0	0	1

1000 rows × 25 columns

3.3 得到自变量和因变量

In [10]:

# 得到自变量和因变量
y = data['Risk_good'].values
data = data.drop(['Risk_good'], axis = 1)
x = data.values

3.4 拆分训练集和测试集

In [11]:

# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(800, 24)
(200, 24)
(800,)
(200,)

3.5 特征缩放

In [12]:

# 特征缩放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)

4. 使用不同的参数构建决策树模型

4.1 模型1：构建决策树模型

4.1.1 构建模型

In [23]:

# 使用不同的参数构建决策树模型
# 模型1：构建决策树模型（criterion = 'entropy', max_depth = 3, min_samples_leaf = 50）
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', max_depth = 5, min_samples_leaf = 10, random_state = 0)
classifier.fit(x_train, y_train)

Out[23]:

DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_leaf=10,
                       random_state=0)

4.1.2 测试集做预测

In [24]:

# 在测试集做预测
y_pred = classifier.predict(x_test)

4.1.3 评估模型性能

In [25]:

# 评估模型性能
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

0.685

In [26]:

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_pred))

[[ 11  48]
 [ 15 126]]

In [27]:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.42      0.19      0.26        59
           1       0.72      0.89      0.80       141

    accuracy                           0.69       200
   macro avg       0.57      0.54      0.53       200
weighted avg       0.64      0.69      0.64       200

4.1.4 画出树形结构

准备工作：

安装graphviz库，使用命令 pip install graphviz
安装graphviz应用程序，并添加环境变量

In [28]:

# 生成.dot文件（在Jupyter Notebook中运行会报错，但是.dot文件却成功生成了。在控制台运行不会报错。）
from sklearn import tree
import graphviz
graphviz.Source(tree.export_graphviz(classifier, out_file='output.dot'))

Out[28]:

<graphviz.files.Source at 0x1b38fe6b188>

注：下面代码在Jupyter无法运行，需要在控制台运行

In [ ]:

'''
# 将 dot 文件转换成图片文件或pdf文件
dot -Tpng output.dot -o output.png # 转换成png文件
dot -Tpdf output.dot -o output.pdf # 转换成pdf文件
'''

4.2 模型2：构建决策树模型

In [34]:

# 模型2：构建决策树模型（criterion = 'gini', max_depth = 9, min_samples_leaf = 10）
classifier = DecisionTreeClassifier(criterion = 'gini', max_depth = 5, min_samples_leaf = 10, min_samples_split=10, random_state = 0)
classifier.fit(x_train, y_train)

Out[34]:

DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, min_samples_split=10,
                       random_state=0)

In [35]:

# 在测试集做预测
y_pred = classifier.predict(x_test)

In [36]:

# 评估模型性能
print(accuracy_score(y_test, y_pred))

0.755

In [37]:

print(confusion_matrix(y_test, y_pred))

[[ 27  32]
 [ 17 124]]

In [38]:

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.61      0.46      0.52        59
           1       0.79      0.88      0.84       141
    accuracy                           0.76       200
   macro avg       0.70      0.67      0.68       200
weighted avg       0.74      0.76      0.74       200

结论：由上面2个模型可见，不同超参数对模型性能的影响不同

一不小心就进橘子了

橘子种植园

机器学习—回归与分类4-1（决策树算法）

主要步骤流程：