Laurel

导航

利用Sklearn实现加州房产价格预测,学习运用机器学习的整个流程(包含很多细节注解)

 

Chapter1_housing_price_predict

 

 

 

一、导入需要用到的库

In [6]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import os
import tarfile
from six.moves import urllib
 

二、编写获取加州房产数据的函数fetch_housing_data(),获取housing.csv数据

当调用fetch_housing_data(),就会在工作空间创建一个datasets/housing目录, 并且下载housing.tgz,解压housing.tgz

In [7]:
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
fetch_housing_data()
 

三、编写读取housing.csv的函数load_housing_data(),加载加州房产数据

In [8]:
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing=load_housing_data()
 

四、利用Sklearn分割数据,获得训练数据集和测试数据集

 

(一)利用train_test_split 关于train_test_split常用参数说明(详细可见sklearn文档说明):

test_size : float, int or None, optional (default=0.25)

random_state :用于设置随机数生成器的种子,目的是保证当多次运行此段代码能够得到完全一样的分割结果,常设为42

shuffle:布尔值。默认为True,设为True时代表在分割数据集前先对数据进行洗牌(随机打乱数据集)

stratify:默认为None.当shuffle=True时,才能不为None,如果不是None,则数据集以分层方式拆分,并使用此作为类标签。

In [9]:
from sklearn.model_selection import train_test_split
train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42)
 

(二)利用StratifiedShuffleSplit 关于StratifiedShuffleSplit说明(详细可见sklearn文档说明):

这个函数主要是为了用于实现交叉验证(见后续),实现分层方式分割。

其创建的每一组划分将保证每组类比例相同与原数据集中各类的比例保持相同,即第一组训练数据类别比例为2:1,则后面每组类别都满足这个比例 参数说明:

n_splits是将训练数据分成train/test对的组数,可根据需要进行设置,默认为10

(分层方式是指保持原数据集各个类的比例进行分割。比如原来数据集有两类A和B,A:B=5:2,那么在分割后训练数据集和测试数据集中A和B的比例也各自均为5:2。这样利用分层采样可以避免产生严重偏差)

 

为了进行分层分割,首先我们的数据集应该有类别。假设收入中位数是预测房价中位数非常重要的属性,我们根据多种收入分类。

首先看一下原数据的收入中位数分布

In [11]:
housing["median_income"].hist()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd9cbd7f60>
 
 

然后我们对收入中位数进行处理:

(1)首先将每个收入中位数除以1.5(用于限制收入分类的数量),用ceil对值舍入,向上取整(以产生离散的分类)

(2)将所有大于5的收入中位数归入到类别5,小于5的收入中位数保持对应的数值作为其类别(1,2,3,4).关于where的使用见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.where.html (需要尤其注意它的other参数,对于不满足cond参数的部分的值将变为other参数的值5.0)

In [16]:
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
# print (housing["income_cat"])
# print (type(housing["income_cat"]))
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
# print (housing["income_cat"])
 

经过处理后的收入中位数分布:

In [15]:
housing["income_cat"].hist()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbd895ff048>
 
In [21]:
from sklearn.model_selection import StratifiedShuffleSplit
ss=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in ss.split(housing,housing["income_cat"]):
    strat_train_set=housing.loc[train_index]
    strat_test_set=housing.loc[test_index]
 

验证分割后的数据的类别比例与原数据的类别比例保持一致

In [22]:
strat_test_set["income_cat"].value_counts()/len(strat_test_set)
Out[22]:
3.0    0.350533
2.0    0.318798
4.0    0.176357
5.0    0.114583
1.0    0.039729
Name: income_cat, dtype: float64
In [23]:
housing["income_cat"].value_counts()/len(housing)
Out[23]:
3.0    0.350581
2.0    0.318847
4.0    0.176308
5.0    0.114438
1.0    0.039826
Name: income_cat, dtype: float64
 

最后在利用收入中位数类别对原数据集进行分层分割后,删除数据的income_cat属性,恢复数据的初始状态

In [24]:
for set in (strat_train_set,strat_test_set):
    set.drop(["income_cat"],axis=1,inplace=True)
 

五、数据处理 整个过程只关注分割得到的训练数据

(一)拆分数据,获得特征矩阵X以及只含有median_house_value的y作为真实的label值

In [76]:
housing=strat_train_set.drop("median_house_value",axis=1)
housing_labels=strat_train_set["median_house_value"].copy()
 

(二)清洗数据

1.对于一些有缺失值的特征,需要进行处理,方法有:

(1).去掉对应的正行数据

(2).去掉这个属性对应的整列

(3).对缺失值用(0,平均值,中位数)进行替换

 

查看每个属性的缺失值情况

In [26]:
housing.info()
 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 17606 to 15775
Data columns (total 9 columns):
longitude             16512 non-null float64
latitude              16512 non-null float64
housing_median_age    16512 non-null float64
total_rooms           16512 non-null float64
total_bedrooms        16354 non-null float64
population            16512 non-null float64
households            16512 non-null float64
median_income         16512 non-null float64
ocean_proximity       16512 non-null object
dtypes: float64(8), object(1)
memory usage: 1.3+ MB
 

total_bedrooms有缺失值,需要处理,此处利用total_bedrooms的中位数进行填充

In [33]:
from sklearn.impute import SimpleImputer
simputer=SimpleImputer(strategy="median")
 

因为只有数值属性才能算出中位数,我们需要创建一份不包括文本属性ocean_proximity的数据副本

In [49]:
housing_num_copy=housing.drop("ocean_proximity",axis=1)
 

现在,利用fit()方法将simputer实例拟合到训练数据

In [51]:
simputer.fit(housing_num_copy)
Out[51]:
SimpleImputer(copy=True, fill_value=None, missing_values=nan,
       strategy='median', verbose=0)
 

simputer计算出了每个属性的中位数,并将结果保存在了实例变量statistics_中。虽然此时只有属性total_bedrooms存在缺失值,但我们不能确定在以后的新的数据中会不会有其他属性也存在缺失值,所以安全的做法是将imputer应用到每个数值

In [36]:
simputer.statistics_
Out[36]:
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])
 

现在,就可以使用这个“训练过的”simputer来对训练集进行转换,将缺失值替换为中位数

In [52]:
X=simputer.transform(housing_num_copy)
 

得到的X是包含转换后特征的普通的 Numpy 数组,将其转换为DataFrame

In [53]:
housing_tr = pd.DataFrame(X, columns=housing_num_copy.columns,index = list(housing.index.values))
 

2.处理文本类型数据,转化为数值。主要运用sklearn提供的OneHotEncode编码器

首先看到ocean_proximity属性值是文本类型,需要进行处理

In [45]:
housing_ocean_pro=housing[["ocean_proximity"]]
housing_ocean_pro.head(10)
Out[45]:
 
 ocean_proximity
17606 <1H OCEAN
18632 <1H OCEAN
14650 NEAR OCEAN
3230 INLAND
3555 <1H OCEAN
19480 INLAND
8879 <1H OCEAN
13685 INLAND
4937 <1H OCEAN
4861 <1H OCEAN
 

关于独热编码推荐看这个手记里的讲解(https://www.imooc.com/article/35900)

sklearn的OneHotEncoder实现了独热编码

In [46]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder=OneHotEncoder()
housing_ocean_pro_1hot=cat_encoder.fit_transform(housing_ocean_pro)
housing_ocean_pro_1hot
Out[46]:
<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>
 

可以看到得到的housing_ocean_pro_1hot是sparse matrix(稀疏矩阵),将其转化为numpy数组

In [47]:
housing_ocean_pro_1hot.toarray()
Out[47]:
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
 

利用编码器查看一下ocean_proximity特征有哪些值

In [48]:
cat_encoder.categories_
Out[48]:
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]
 

(三)添加特征列

这里手动实现一个类,熟悉sklearn是依赖鸭子类型(英语:duck typing是动态类型的一种风格。在这种风格中,一个对象有效的语义,不是由继承自特定的类或实现特定的接口,而是由当前方法和属性的集合决定,关注的不是对象的类型本身,而是它是如何使用的)

In [59]:
from sklearn.base import BaseEstimator,TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): 
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
 

(四)特征缩放

除了个别情况,当输入的数值属性量度不同时,机器学习算法的性能都不会好。

有两种常见的方法可以让所有的属性有相同的量度:线性函数归一化(Min-Max scaling)和标准化(standardization)。

1.归一化(normalization)

对属性值进行缩放至(0,1)范围内的某个数。

通过用属性值减去最小值,然后再除以最大值与最小值的差值,来进行归一化。

Scikit-Learn 提供了一个转换器MinMaxScaler来实现这个功能。如果不希望范围是 0 到 1,它有一个超参数feature_range,可以让你改变范围。

2.标准化

使每个特征中的数值平均值变为0(将每个特征的值都减掉原始资料中该特征的平均)、标准差变为1

对于某个特征首先将该特征的每个值减去该特征的平均值(所以标准化值的平均值总是 0),然后除以方差,使得到的分布具有单位方差

Scikit-Learn 提供了一个转换器StandardScaler来进行标准化

3.归一化和标准化对比:

区别:归一化是将样本的特征值转换到同一量纲下把数据映射(0,1)。标准化是依照特征矩阵的列处理数据,其通过求z-score的方法,转换为标准正态分布,和整体样本分布相关,每个样本点都能对标准化产生影响。标准化不会限定值到某个特定的范围,标准化受到异常值的影响很小

相同:都能取消由于量纲不同引起的误差;都是一种线性变换,都是对向量X按照比例压缩再进行平移。

 

(五)利用sklearn的Pipeline流水线化

 

构建数值型特征列,文本类别型特征列名

In [64]:
num_attribs=list(housing_num_copy)
cat_attribs=["ocean_proximity"]
# print(num_attribs)
# print(cat_attribs)
 
['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
['ocean_proximity']
 

1.创建数值型特征的流水线

In [114]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_type_pipeline=Pipeline([
    ('simputer',SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
    
])
 

2.利用sklearn的ColumnTransformer组合流水线

关于ColumnTransformer参考(https://www.codercto.com/a/31047.html)

In [115]:
from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
        ("num", num_type_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])
 

3.运行整个流水线,实现对数据的各种处理

In [116]:
housing_prepared=full_pipeline.fit_transform(housing)
housing_prepared
Out[116]:
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
 

六、选择并训练模型

(一)利用线性回归模型

In [77]:
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)
Out[77]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
 

选取一些训练数据进行模型评估

In [82]:
some_data=housing.iloc[:5]
some_labels=housing_labels.iloc[:5]
some_data_prepared=full_pipeline.transform(some_data)
print (lin_reg.predict(some_data_prepared))
print(list(some_labels))
 
[210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
[286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
 

从上面的预测结果与真实结果比对,可以看出有很大偏差

接下来,看一下利用线性回归整个训练集的RMSE

In [83]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
 
68628.19819848922
 

可以看到预测值与真实房价的rmse为68628,模型欠拟合

修复欠拟合的主要方法是选择一个更强大的模型,给训练算法提供更好的特征,或去掉模型上的限制。

(二)利用决策树回归模型

In [84]:
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepared,housing_labels)
Out[84]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')
 

1.用训练集评估模型

In [86]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)
 
0.0
 

这个结果并不是表示训练结果有多好,反而表明存在着严重的过拟合

 

2.利用交叉验证做更好的评估

交叉验证可以参考(https://www.cnblogs.com/sddai/p/5696834.html)

特别注意cross_val_score中的参数score的取值,参考(https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules)

这里利用K折交叉验证,随机地将训练集分成十个不同的子集,成为“折”,然后训练评估决策树模型 10 次,每次选一个不用的折来做评估,用其它 9 个来做训练。结果是一个包含 10 个评分的数组:z

In [96]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
# print(scores)
rmse_scores = np.sqrt(-scores)
In [90]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
In [93]:
display_scores(tree_rmse_scores)
 
Scores: [69074.07529867 67212.26643168 71226.93340782 69011.53460089
 70760.95156189 74783.59188961 69079.5355068  71798.81224067
 75546.22137756 69397.60073089]
Mean: 70789.15230464848
Standard deviation: 2520.949152366461
 

对线性回归同样进行K折交叉验证,两者结果进行比较

In [94]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
 
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.674001798348
 

交叉验证不仅可以让你得到模型性能的评估,还能测量评估的准确性(即,它的标准差)

对比发现决策树模型过拟合很严重,它的性能比线性回归模型还差。

 

(三)利用随机森林回归模型

In [97]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
Out[97]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)
In [100]:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
Out[100]:
21933.31414779769
 

K折交叉验证评估模型

In [98]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
 
Scores: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
Mean: 52583.72407377466
Standard deviation: 2298.353351147122
 

训练集的评分仍然比验证集的评分低很多。

 

六、模型调参

(一)网格搜索

使用 Scikit-Learn 的GridSearchCV来做这项搜索工作。它存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数。

需要做的是告诉GridSearchCV要试验有哪些超参数,要试验什么值,GridSearchCV就能用交叉验证试验所有可能超参数值的组合。

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

参数说明(详见https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

estimator:选择使用的分类器,并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数,或者score方法

param_grid:需要最优化的参数的取值,值为字典或者列表

cv=None:交叉验证参数,默认None,使用三折交叉验证。指定fold数量,默认为3,也可以是yield训练/测试数据的生成器

例如,下面的代码搜索了RandomForestRegressor超参数值的最佳组合:

In [101]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)
Out[101]:
GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30]}, {'n_estimators': [3, 10], 'max_features': [2, 3, 4], 'bootstrap': [False]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)
 

param_grid告诉 Scikit-Learn 首先评估所有的列在第一个dict中的n_estimators和max_features的3 × 4 = 12种组合(不用担心这些超参数的含义,会在后面的随机森林解释)。然后尝试第二个dict中超参数的2 × 3 = 6种组合,这次会将超参数bootstrap设为False而不是True(后者是该超参数的默认值)。

总之,网格搜索会探索12 + 6 = 18种RandomForestRegressor的超参数组合,会训练每个模型五次(因为用的是五折交叉验证)。换句话说,训练总共有18 × 5 = 90轮!K 折将要花费大量时间,完成后,就能获得参数的最佳组合,如下所示:

In [102]:
grid_search.best_params_
Out[102]:
{'max_features': 8, 'n_estimators': 30}
 

查看最佳的估计器

In [104]:
grid_search.best_estimator_
Out[104]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)
 

查看每一个参数组合的评估得分

In [106]:
cvres=grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
 
64835.28497462568 {'max_features': 2, 'n_estimators': 3}
55992.209032344894 {'max_features': 2, 'n_estimators': 10}
53196.79504781103 {'max_features': 2, 'n_estimators': 30}
60216.336137120685 {'max_features': 4, 'n_estimators': 3}
53353.71606249172 {'max_features': 4, 'n_estimators': 10}
50876.67389111256 {'max_features': 4, 'n_estimators': 30}
59545.52190035615 {'max_features': 6, 'n_estimators': 3}
52527.66685438906 {'max_features': 6, 'n_estimators': 10}
50139.12484396652 {'max_features': 6, 'n_estimators': 30}
59067.08851253479 {'max_features': 8, 'n_estimators': 3}
51813.87695997916 {'max_features': 8, 'n_estimators': 10}
49987.25641246688 {'max_features': 8, 'n_estimators': 30}
62541.338087303535 {'n_estimators': 3, 'max_features': 2, 'bootstrap': False}
54836.498222902934 {'n_estimators': 10, 'max_features': 2, 'bootstrap': False}
60487.55001142947 {'n_estimators': 3, 'max_features': 3, 'bootstrap': False}
53044.81804206819 {'n_estimators': 10, 'max_features': 3, 'bootstrap': False}
57875.95997175016 {'n_estimators': 3, 'max_features': 4, 'bootstrap': False}
51840.92942525009 {'n_estimators': 10, 'max_features': 4, 'bootstrap': False}
 

我们通过设定超参数max_features为8,n_estimators为30,得到了最佳方案。对这个组合,RMSE 的值是 49987,这比之前使用默认的超参数的值(52583)要稍微好一些

 

(二)随机搜索

当超参数的搜索空间很大时,最好使用RandomizedSearchCV。这个类的使用方法和类GridSearchCV很相似,但它不是尝试所有可能的组合,而是通过选择每个超参数的一个随机值的特定数量的随机组合。这个方法有两个优点:

1).如果让随机搜索运行,比如 1000 次,它会探索每个超参数的 1000 个不同的值(而不是像网格搜索那样,只搜索每个超参数的几个值)。

2).可以方便地通过设定搜索次数,控制超参数搜索的计算量。

In [107]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
Out[107]:
RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd87196710>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd871965f8>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)
In [109]:
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
 
49150.657232934034 {'max_features': 7, 'n_estimators': 180}
51389.85295710133 {'max_features': 5, 'n_estimators': 15}
50796.12045980556 {'max_features': 3, 'n_estimators': 72}
50835.09932039744 {'max_features': 5, 'n_estimators': 21}
49280.90117886215 {'max_features': 7, 'n_estimators': 122}
50774.86679035961 {'max_features': 3, 'n_estimators': 75}
50682.75001237282 {'max_features': 3, 'n_estimators': 88}
49608.94061293652 {'max_features': 5, 'n_estimators': 100}
50473.57642831875 {'max_features': 3, 'n_estimators': 150}
64429.763804893395 {'max_features': 5, 'n_estimators': 2}
 

查看每个特征的相对重要性

In [108]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
Out[108]:
array([7.04224175e-02, 6.02124940e-02, 4.59296052e-02, 1.55428191e-02,
       1.50341798e-02, 1.60330990e-02, 1.48669649e-02, 3.80234348e-01,
       4.54001758e-02, 1.11548704e-01, 6.46238352e-02, 1.10748235e-02,
       1.42318203e-01, 3.89612797e-05, 1.76021449e-03, 4.95915437e-03])
 

将重要性分数和属性名放到一起

In [117]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"] 
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Out[117]:
[(0.38023434841388537, 'median_income'),
 (0.14231820319691774, 'INLAND'),
 (0.11154870415426024, 'pop_per_hhold'),
 (0.07042241751016282, 'longitude'),
 (0.06462383516760929, 'bedrooms_per_room'),
 (0.060212493972978696, 'latitude'),
 (0.04592960524268814, 'housing_median_age'),
 (0.04540017578326126, 'rooms_per_hhold'),
 (0.0160330990414146, 'population'),
 (0.015542819135579315, 'total_rooms'),
 (0.015034179787027079, 'total_bedrooms'),
 (0.014866964945789509, 'households'),
 (0.011074823508549598, '<1H OCEAN'),
 (0.004959154365657991, 'NEAR OCEAN'),
 (0.0017602144944791275, 'NEAR BAY'),
 (3.896127973928007e-05, 'ISLAND')]
 

有了这个信息,就可以丢弃一些不那么重要的特征(比如,显然只要一个ocean_proximity的类型(INLAND)就够了,所以可以丢弃掉其它的)

 

七、用测试集评估模型

In [118]:
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
In [119]:
final_rmse 
Out[119]:
47997.74457886712
 

总结,通过以上七大步骤可以看到机器学习项目是什么样的,流程是什么样的。这中间数据处理是一个很重要的环节,我们用于训练的特征很大程度决定我们最后模型的性能。接下来希望自己能坚持下来能利用《Hands-On Machine Learning with Scikit-Learn and TensorFlow》这本书坚持下来学习机器学习和深度学习,结合官方文档查看用到的类和函数的使用,加油!

posted on 2019-01-19 17:05  Laurel1115  阅读(1556)  评论(0编辑  收藏  举报