Pytorch 4.9 实战Kaggle比赛:预测房价
实战Kaggle比赛:预测房价
我们要使用\(Bart de Cock\)于2011年收集 \([DeCock, 2011]\), 涵盖了 \(2006-2010\) 年期间亚利桑那州埃姆斯市的房价。来预测房价。
step.1下载数据集
我们有两种方式下载数据集
- 第一种方式是注册一个 \(Kaggle\) 帐号然后直接去 \(Kaggle\) 的官网上下载: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (注意要FQ)
- 参考李沐\(《Dive into deep learning》\)第四章第十节,使用python代码下载: http://zh.d2l.ai/chapter_multilayer-perceptrons/kaggle-house-price.html#id2 (注意使用jupyter下载)
step2.加载数据集(这里的相对路径取决于你所下载的地方!)
import numpy as np
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l
train_data = pd.read_csv("../data/kaggle_house_pred_train.csv")
test_data = pd.read_csv("../data/kaggle_house_pred_test.csv")
train_data.shape , test_data.shape # 数据大小 ((1460, 81), (1459, 80))
# 查看我们的数据集长啥样,以便于预处理
train_data
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal |
1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal |
1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal |
1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal |
1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal |
step3.做数据的预处理:剔除无用数据,数据标准化,缺失值设置为0
# 我们预测的是房价,波动会很大,我们要标准化数据,就像正态分布的标准化一样
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
这里利用concat合并两个数据,但是其中剔除训练和测试数据的第一列是因为第一列是索引编号,对于训练没有任何帮助,剔除训练数据最后一列是因为要和测试数据对齐。训练:(1460, 81),测试:(1459, 80)-->(2919, 79) 前面数据去掉首行和最后一行只是为了对齐。
# 若无法获得测试数据,则可根据训练数据计算均值和标准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index # 获得我们数据的index,也就是上面标注的'MSSubClass', 'LotFrontage' ......
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# 在标准化数据之后,所有均值消失,因此我们可以将缺失值设置为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)
# “Dummy_na=True”将“na”(缺失值)视为有效的特征值,并为其创建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True)
all_features.shape # (2919, 331)
step4.将清洗好的数据转换成torch的数据
# 从pandas格式中提取NumPy格式,并将其转换为张量表示用于训练。
n_train= train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values,dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32) # 其实就是最后一列的数据
step5.定义损失函数,k-折交叉验证,训练函数...
loss = nn.MSELoss()
in_features = train_features.shape[1] # 因为已经进行数据预处理了,所以我们得从新计算下列数
def get_net():
net = nn.Sequential(nn.Linear(in_features,1))
return net
# 对于房价,我们得关注其相对质量(y- y^hat)/y,而不是绝对质量 y-y_hat
def log_rmse(net, features, labels):
# 为了在取对数时进一步稳定该值,将小于1的值设置为1
clipped_preds = torch.clamp(net(features), 1, float('inf'))
rmse = torch.sqrt(loss(torch.log(clipped_preds),
torch.log(labels)))
return rmse.item()
# 定义训练函数
def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
train_ls, test_ls = [], []
train_iter = d2l.load_array((train_features, train_labels), batch_size)
# 这里使用的是Adam优化算法 和SGD差不多,但是相对于SGD对于学习率更不敏感
optimizer = torch.optim.Adam(net.parameters(),
lr = learning_rate,
weight_decay = weight_decay)
for epoch in range(num_epochs):
for X,y in train_iter:
optimizer.zero_grad()
l=loss(net(X),y)
l.backward()
optimizer.step()
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
# 定义K 折交叉验证
def get_k_fold_data(k, i, X, y):
assert k > 1
fold_size = X.shape[0] // k # 整除 -->得到每一折的大小 fold_size
X_train, y_train = None, None
for j in range(k):
idx = slice(start= j * fold_size,stop= (j + 1) * fold_size) # python内置的切片的方法
X_part, y_part = X[idx, :], y[idx] # 具体用法看下面 Q&A
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat([X_train, X_part], 0) # 上下缝合
y_train = torch.cat([y_train, y_part], 0)
return X_train, y_train, X_valid, y_valid
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
batch_size):
train_l_sum, valid_l_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
train_l_sum += train_ls[-1]
valid_l_sum += valid_ls[-1]
if i == 0:
d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
legend=['train', 'valid'], yscale='log')
print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f}, '
f'验证log rmse{float(valid_ls[-1]):f}')
return train_l_sum / k, valid_l_sum / k
step6.模型选择
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
f'平均验证log rmse: {float(valid_l):f}')
折1,训练log rmse0.170172, 验证log rmse0.157228
折2,训练log rmse0.162814, 验证log rmse0.191097
折3,训练log rmse0.163801, 验证log rmse0.168365
折4,训练log rmse0.168104, 验证log rmse0.154445
折5,训练log rmse0.162938, 验证log rmse0.182995
5-折验证: 平均训练log rmse: 0.165566, 平均验证log rmse: 0.170826
Q&A
Q1: pd.concat的用法? 拼接数据。
- objs: series,dataframe或者是panel构成的序列lsit
- axis: 需要合并链接的轴,0是行,1是列
- join:连接的方式 inner,或者outer
import pandas as pd
import numpy as np
#构造数据
df1=pd.DataFrame([['A11','A12','A13','A14'],['A21','A22','A23','A24'],['A31','A32','A33','A34'],['A41','A42','A43','A44']],columns=list('ABCD'))
df2=pd.DataFrame([['B11','B12','B13','B14'],['B21','B22','B23','B24'],['B31','B32','B33','B34'],['B41','B42','B43','B44']],columns=list('ABCD'))
df3=pd.DataFrame([['C11','C12','C13','C14'],['C21','C22','C23','C24'],['C31','C32','C33','C34'],['C41','C42','C43','C44']],columns=list('ABCD'))
df4=pd.DataFrame([['D11','D12','D13','D14'],['D21','D22','D23','D24'],['D31','D32','D33','D34']],columns=list('ABCD'))
frames = [df1,df2,df3]
# 默认参数下是纵向相接 axis=0
pd.concat(objs= frames,axis= 1)
A | B | C | D | A | B | C | D | A | B | C | D |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A11 | A12 | A13 | A14 | B11 | B12 | B13 | B14 | C11 | C12 | C13 |
1 | A21 | A22 | A23 | A24 | B21 | B22 | B23 | B24 | C21 | C22 | C23 |
2 | A31 | A32 | A33 | A34 | B31 | B32 | B33 | B34 | C31 | C32 | C33 |
3 | A41 | A42 | A43 | A44 | B41 | B42 | B43 | B44 | C41 | C42 | C43 |
# 当行数或者列数不相同的时候,连接起来的表格如何 ,会出现NaN这种缺空的值
pd.concat([df1,df4],axis=1)
A | B | C | D | A | B | C | D |
---|---|---|---|---|---|---|---|
0 | A11 | A12 | A13 | A14 | D11 | D12 | D13 |
1 | A21 | A22 | A23 | A24 | D21 | D22 | D23 |
2 | A31 | A32 | A33 | A34 | D31 | D32 | D33 |
3 | A41 | A42 | A43 | A44 | NaN | NaN | NaN |
**Q2:** torch.clamp()用法? 在区间内取值,大于或者小于取min或者max。
将输入input张量每个元素的夹紧到区间 [min,max],并返回结果到一个新张量。
Q3:slice() 函数? slice() 函数实现切片对象,主要用在切片操作函数里的参数传递。
Q4: log_rmse 这个自定义函数看不懂?
对于房价来说,我们不能够使用绝对误差,我们应该取相对误差, \(\frac{\hat{y}}{y}\) , 但是为了避免计算的复杂度,我们决定使用对数,也就是 \(|\log y - \log \hat{y}|≤δ\) 转换为 \(e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta\) , 这使得预测价格的对数与真实标签价格的对数之间出现以下均方根误差:
我们定义的函数中是先取对数再进行均方根误差运算的,所以我们得把相对房价在区间内取值,也就是 torch.clamp() 这个用法。
posted on 2022-01-15 23:44 YangShusen' 阅读(887) 评论(0) 编辑 收藏 举报