返回顶部

请叫我杨先生

导航

Pytorch 4.9 实战Kaggle比赛:预测房价

实战Kaggle比赛:预测房价

我们要使用\(Bart de Cock\)于2011年收集 \([DeCock, 2011]\), 涵盖了 \(2006-2010\) 年期间亚利桑那州埃姆斯市的房价。来预测房价。

step.1下载数据集

我们有两种方式下载数据集

  1. 第一种方式是注册一个 \(Kaggle\) 帐号然后直接去 \(Kaggle\) 的官网上下载: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (注意要FQ)
  2. 参考李沐\(《Dive into deep learning》\)第四章第十节,使用python代码下载: http://zh.d2l.ai/chapter_multilayer-perceptrons/kaggle-house-price.html#id2 (注意使用jupyter下载)

step2.加载数据集(这里的相对路径取决于你所下载的地方!)

import numpy as np 
import pandas as pd  
import torch 
from torch import nn 
from d2l import torch as d2l   

train_data = pd.read_csv("../data/kaggle_house_pred_train.csv") 
test_data = pd.read_csv("../data/kaggle_house_pred_test.csv")  
train_data.shape , test_data.shape # 数据大小 ((1460, 81), (1459, 80))
# 查看我们的数据集长啥样,以便于预处理
train_data
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal

step3.做数据的预处理:剔除无用数据,数据标准化,缺失值设置为0

# 我们预测的是房价,波动会很大,我们要标准化数据,就像正态分布的标准化一样 
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) 

这里利用concat合并两个数据,但是其中剔除训练和测试数据的第一列是因为第一列是索引编号,对于训练没有任何帮助,剔除训练数据最后一列是因为要和测试数据对齐。训练:(1460, 81),测试:(1459, 80)-->(2919, 79) 前面数据去掉首行和最后一行只是为了对齐。

# 若无法获得测试数据,则可根据训练数据计算均值和标准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index # 获得我们数据的index,也就是上面标注的'MSSubClass', 'LotFrontage' ......
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))

# 在标准化数据之后,所有均值消失,因此我们可以将缺失值设置为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# “Dummy_na=True”将“na”(缺失值)视为有效的特征值,并为其创建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True) 
all_features.shape # (2919, 331) 

step4.将清洗好的数据转换成torch的数据

# 从pandas格式中提取NumPy格式,并将其转换为张量表示用于训练。 
n_train= train_data.shape[0] 
train_features = torch.tensor(all_features[:n_train].values,dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
    train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32) # 其实就是最后一列的数据

step5.定义损失函数,k-折交叉验证,训练函数...

loss = nn.MSELoss() 
in_features = train_features.shape[1] # 因为已经进行数据预处理了,所以我们得从新计算下列数
def get_net():
    net = nn.Sequential(nn.Linear(in_features,1))
    return net
# 对于房价,我们得关注其相对质量(y- y^hat)/y,而不是绝对质量 y-y_hat   
def log_rmse(net, features, labels):
    # 为了在取对数时进一步稳定该值,将小于1的值设置为1
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds),
                           torch.log(labels)))
    return rmse.item()
# 定义训练函数 
def train(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):
    train_ls, test_ls = [], [] 
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    # 这里使用的是Adam优化算法 和SGD差不多,但是相对于SGD对于学习率更不敏感 
    optimizer = torch.optim.Adam(net.parameters(),
                                 lr = learning_rate,
                                 weight_decay = weight_decay) 
    for epoch in range(num_epochs): 
        for X,y in train_iter: 
            optimizer.zero_grad()
            l=loss(net(X),y) 
            l.backward()  
            optimizer.step()  
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls
# 定义K 折交叉验证       
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k  # 整除 -->得到每一折的大小 fold_size  
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(start= j * fold_size,stop= (j + 1) * fold_size) # python内置的切片的方法
        X_part, y_part = X[idx, :], y[idx] # 具体用法看下面 Q&A
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)  # 上下缝合 
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid

def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
           batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
                                   weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        if i == 0:
            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
                     legend=['train', 'valid'], yscale='log')
        print(f'折{i + 1},训练log rmse{float(train_ls[-1]):f}, '
              f'验证log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

step6.模型选择

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
                          weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
      f'平均验证log rmse: {float(valid_l):f}')

折1,训练log rmse0.170172, 验证log rmse0.157228
折2,训练log rmse0.162814, 验证log rmse0.191097
折3,训练log rmse0.163801, 验证log rmse0.168365
折4,训练log rmse0.168104, 验证log rmse0.154445
折5,训练log rmse0.162938, 验证log rmse0.182995
5-折验证: 平均训练log rmse: 0.165566, 平均验证log rmse: 0.170826

Q&A

Q1: pd.concat的用法? 拼接数据。

  • objs: series,dataframe或者是panel构成的序列lsit
  • axis: 需要合并链接的轴,0是行,1是列
  • join:连接的方式 inner,或者outer
import pandas as pd  
import numpy as np  

#构造数据
df1=pd.DataFrame([['A11','A12','A13','A14'],['A21','A22','A23','A24'],['A31','A32','A33','A34'],['A41','A42','A43','A44']],columns=list('ABCD'))
df2=pd.DataFrame([['B11','B12','B13','B14'],['B21','B22','B23','B24'],['B31','B32','B33','B34'],['B41','B42','B43','B44']],columns=list('ABCD'))
df3=pd.DataFrame([['C11','C12','C13','C14'],['C21','C22','C23','C24'],['C31','C32','C33','C34'],['C41','C42','C43','C44']],columns=list('ABCD'))
df4=pd.DataFrame([['D11','D12','D13','D14'],['D21','D22','D23','D24'],['D31','D32','D33','D34']],columns=list('ABCD'))

frames = [df1,df2,df3]

# 默认参数下是纵向相接 axis=0 
pd.concat(objs= frames,axis= 1)
A B C D A B C D A B C D
0 A11 A12 A13 A14 B11 B12 B13 B14 C11 C12 C13
1 A21 A22 A23 A24 B21 B22 B23 B24 C21 C22 C23
2 A31 A32 A33 A34 B31 B32 B33 B34 C31 C32 C33
3 A41 A42 A43 A44 B41 B42 B43 B44 C41 C42 C43
# 当行数或者列数不相同的时候,连接起来的表格如何 ,会出现NaN这种缺空的值 
pd.concat([df1,df4],axis=1)
A B C D A B C D
0 A11 A12 A13 A14 D11 D12 D13
1 A21 A22 A23 A24 D21 D22 D23
2 A31 A32 A33 A34 D31 D32 D33
3 A41 A42 A43 A44 NaN NaN NaN

**Q2:** torch.clamp()用法? 在区间内取值,大于或者小于取min或者max。

将输入input张量每个元素的夹紧到区间 [min,max],并返回结果到一个新张量。

Q3:slice() 函数? slice() 函数实现切片对象,主要用在切片操作函数里的参数传递。

Q4: log_rmse 这个自定义函数看不懂?
对于房价来说,我们不能够使用绝对误差,我们应该取相对误差, \(\frac{\hat{y}}{y}\) , 但是为了避免计算的复杂度,我们决定使用对数,也就是 \(|\log y - \log \hat{y}|≤δ\) 转换为 \(e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta\) , 这使得预测价格的对数与真实标签价格的对数之间出现以下均方根误差:

\[\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2} \]

我们定义的函数中是先取对数再进行均方根误差运算的,所以我们得把相对房价在区间内取值,也就是 torch.clamp() 这个用法。

posted on 2022-01-15 23:44  YangShusen'  阅读(887)  评论(0编辑  收藏  举报