机器学习—回归2-4(岭回归)
使用岭回归根据多个因素预测医疗费用
In [1]:
# 导入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 导入数据集
In [2]:
# 导入数据集
data = pd.read_csv('insurance.csv')
data.head()
Out[2]:
3. 数据预处理
3.1 检测缺失值
In [3]:
# 检测缺失值
null_df = data.isnull().sum()
null_df
Out[3]:
3.2 标签编码&独热编码
In [4]:
# 标签编码&独热编码
data = pd.get_dummies(data, drop_first = True)
data.head()
Out[4]:
3.3 得到自变量和因变量
In [5]:
# 得到自变量和因变量
y = data['charges'].values
data = data.drop(['charges'], axis = 1)
x = data.values
3.4 拆分训练集和测试集
In [6]:
# 拆分训练集和测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
4. 构建不同参数的岭回归模型
4.1 模型1:构建岭回归模型
4.1.1 构建岭回归模型
In [7]:
# 构建不同参数的岭回归模型
# 模型1:构建岭回归模型(alpha = 20)
from sklearn.linear_model import Ridge
regressor = Ridge(alpha = 20, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[7]:
4.1.2 得到数学表达式
In [8]:
# 得到数学表达式
print('数学表达式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
4.1.3 预测测试集
In [9]:
# 预测测试集
y_pred = regressor.predict(x_test)
4.1.4 得到模型MSE
In [10]:
# 得到模型 MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=20时,岭回归模型的MSE是:' , format(mse_score, ','))
4.2 模型2:构建岭回归模型
In [11]:
# 模型2:构建岭回归模型(alpha = 0.1)
regressor = Ridge(alpha = 0.1, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[11]:
In [12]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
In [13]:
# 预测测试集
y_pred = regressor.predict(x_test)
In [14]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.1时,岭回归模型的MSE是:' , format(mse_score, ','))
4.3 模型3:构建岭回归模型
In [15]:
# 模型3:构建岭回归模型(alpha = 0.01)
regressor = Ridge(alpha = 0.01, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[15]:
In [16]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
In [17]:
# 预测测试集
y_pred = regressor.predict(x_test)
In [18]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.01时,岭回归模型的MSE是:' , format(mse_score, ','))
4.4 模型4:构建岭回归模型
In [19]:
# 模型4:构建岭回归模型(alpha = 0.0001)
regressor = Ridge(alpha = 0.0001, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[19]:
In [20]:
# 得到线性表达式
print('数学表达式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
In [21]:
# 预测测试集
y_pred = regressor.predict(x_test)
In [22]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.0001时,岭回归模型的MSE是:' , format(mse_score, ','))
结论: 由上面4个模型可见,不同的模型超参数对岭回归模型性能的影响不同。