关于独热编码推荐看这个手记里的讲解（https://www.imooc.com/article/35900）

sklearn的OneHotEncoder实现了独热编码

from sklearn.preprocessing import OneHotEncoder
cat_encoder=OneHotEncoder()
housing_ocean_pro_1hot=cat_encoder.fit_transform(housing_ocean_pro)
housing_ocean_pro_1hot

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

可以看到得到的housing_ocean_pro_1hot是sparse matrix（稀疏矩阵），将其转化为numpy数组

housing_ocean_pro_1hot.toarray()

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

利用编码器查看一下ocean_proximity特征有哪些值

cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

(三)添加特征列

这里手动实现一个类，熟悉sklearn是依赖鸭子类型（英语：duck typing是动态类型的一种风格。在这种风格中，一个对象有效的语义，不是由继承自特定的类或实现特定的接口，而是由当前方法和属性的集合决定，关注的不是对象的类型本身，而是它是如何使用的）

from sklearn.base import BaseEstimator,TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): 
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

(四）特征缩放

除了个别情况，当输入的数值属性量度不同时，机器学习算法的性能都不会好。

有两种常见的方法可以让所有的属性有相同的量度：线性函数归一化（Min-Max scaling）和标准化（standardization）。

1.归一化（normalization）

对属性值进行缩放至（0,1）范围内的某个数。

通过用属性值减去最小值，然后再除以最大值与最小值的差值，来进行归一化。

Scikit-Learn 提供了一个转换器MinMaxScaler来实现这个功能。如果不希望范围是 0 到 1，它有一个超参数feature_range，可以让你改变范围。

2.标准化

使每个特征中的数值平均值变为0(将每个特征的值都减掉原始资料中该特征的平均)、标准差变为1

对于某个特征首先将该特征的每个值减去该特征的平均值（所以标准化值的平均值总是 0），然后除以方差，使得到的分布具有单位方差

Scikit-Learn 提供了一个转换器StandardScaler来进行标准化

3.归一化和标准化对比：

区别：归一化是将样本的特征值转换到同一量纲下把数据映射（0,1）。标准化是依照特征矩阵的列处理数据，其通过求z-score的方法，转换为标准正态分布，和整体样本分布相关，每个样本点都能对标准化产生影响。标准化不会限定值到某个特定的范围，标准化受到异常值的影响很小

相同：都能取消由于量纲不同引起的误差；都是一种线性变换，都是对向量X按照比例压缩再进行平移。

(五）利用sklearn的Pipeline流水线化

构建数值型特征列，文本类别型特征列名

num_attribs=list(housing_num_copy)
cat_attribs=["ocean_proximity"]
# print(num_attribs)
# print(cat_attribs)

['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
['ocean_proximity']

1.创建数值型特征的流水线

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_type_pipeline=Pipeline([
    ('simputer',SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
    
])

2.利用sklearn的ColumnTransformer组合流水线

关于ColumnTransformer参考(https://www.codercto.com/a/31047.html)

from sklearn.compose import ColumnTransformer
full_pipeline = ColumnTransformer([
        ("num", num_type_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

3.运行整个流水线，实现对数据的各种处理

housing_prepared=full_pipeline.fit_transform(housing)
housing_prepared

array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

六、选择并训练模型

(一)利用线性回归模型

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

选取一些训练数据进行模型评估

some_data=housing.iloc[:5]
some_labels=housing_labels.iloc[:5]
some_data_prepared=full_pipeline.transform(some_data)
print (lin_reg.predict(some_data_prepared))
print(list(some_labels))

[210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
[286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

从上面的预测结果与真实结果比对，可以看出有很大偏差

接下来，看一下利用线性回归整个训练集的RMSE

from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)

68628.19819848922

可以看到预测值与真实房价的rmse为68628,模型欠拟合

修复欠拟合的主要方法是选择一个更强大的模型，给训练算法提供更好的特征，或去掉模型上的限制。

(二)利用决策树回归模型

from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepared,housing_labels)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

1.用训练集评估模型

housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print(tree_rmse)

0.0

这个结果并不是表示训练结果有多好，反而表明存在着严重的过拟合

2.利用交叉验证做更好的评估

交叉验证可以参考（https://www.cnblogs.com/sddai/p/5696834.html)

特别注意cross_val_score中的参数score的取值，参考(https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules）

这里利用K折交叉验证，随机地将训练集分成十个不同的子集，成为“折”，然后训练评估决策树模型 10 次，每次选一个不用的折来做评估，用其它 9 个来做训练。结果是一个包含 10 个评分的数组：z

from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
# print(scores)
rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Scores: [69074.07529867 67212.26643168 71226.93340782 69011.53460089
 70760.95156189 74783.59188961 69079.5355068  71798.81224067
 75546.22137756 69397.60073089]
Mean: 70789.15230464848
Standard deviation: 2520.949152366461

对线性回归同样进行K折交叉验证，两者结果进行比较

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.674001798348

交叉验证不仅可以让你得到模型性能的评估，还能测量评估的准确性（即，它的标准差）

对比发现决策树模型过拟合很严重，它的性能比线性回归模型还差。

(三）利用随机森林回归模型

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse

21933.31414779769

K折交叉验证评估模型

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Scores: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
Mean: 52583.72407377466
Standard deviation: 2298.353351147122

训练集的评分仍然比验证集的评分低很多。

六、模型调参

(一)网格搜索

使用 Scikit-Learn 的GridSearchCV来做这项搜索工作。它存在的意义就是自动调参，只要把参数输进去，就能给出最优化的结果和参数。

需要做的是告诉GridSearchCV要试验有哪些超参数，要试验什么值，GridSearchCV就能用交叉验证试验所有可能超参数值的组合。

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=None, iid=’warn’, refit=True, cv=’warn’, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’, return_train_score=’warn’)

参数说明（详见https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

estimator:选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法

param_grid:需要最优化的参数的取值，值为字典或者列表

cv=None:交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器

例如，下面的代码搜索了RandomForestRegressor超参数值的最佳组合：

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared, housing_labels)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'max_features': [2, 4, 6, 8], 'n_estimators': [3, 10, 30]}, {'n_estimators': [3, 10], 'max_features': [2, 3, 4], 'bootstrap': [False]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

param_grid告诉 Scikit-Learn 首先评估所有的列在第一个dict中的n_estimators和max_features的3 × 4 = 12种组合（不用担心这些超参数的含义，会在后面的随机森林解释）。然后尝试第二个dict中超参数的2 × 3 = 6种组合，这次会将超参数bootstrap设为False而不是True（后者是该超参数的默认值）。

总之，网格搜索会探索12 + 6 = 18种RandomForestRegressor的超参数组合，会训练每个模型五次（因为用的是五折交叉验证）。换句话说，训练总共有18 × 5 = 90轮！K 折将要花费大量时间，完成后，就能获得参数的最佳组合，如下所示：

grid_search.best_params_

{'max_features': 8, 'n_estimators': 30}

查看最佳的估计器

grid_search.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features=8, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=30, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)

查看每一个参数组合的评估得分

cvres=grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

64835.28497462568 {'max_features': 2, 'n_estimators': 3}
55992.209032344894 {'max_features': 2, 'n_estimators': 10}
53196.79504781103 {'max_features': 2, 'n_estimators': 30}
60216.336137120685 {'max_features': 4, 'n_estimators': 3}
53353.71606249172 {'max_features': 4, 'n_estimators': 10}
50876.67389111256 {'max_features': 4, 'n_estimators': 30}
59545.52190035615 {'max_features': 6, 'n_estimators': 3}
52527.66685438906 {'max_features': 6, 'n_estimators': 10}
50139.12484396652 {'max_features': 6, 'n_estimators': 30}
59067.08851253479 {'max_features': 8, 'n_estimators': 3}
51813.87695997916 {'max_features': 8, 'n_estimators': 10}
49987.25641246688 {'max_features': 8, 'n_estimators': 30}
62541.338087303535 {'n_estimators': 3, 'max_features': 2, 'bootstrap': False}
54836.498222902934 {'n_estimators': 10, 'max_features': 2, 'bootstrap': False}
60487.55001142947 {'n_estimators': 3, 'max_features': 3, 'bootstrap': False}
53044.81804206819 {'n_estimators': 10, 'max_features': 3, 'bootstrap': False}
57875.95997175016 {'n_estimators': 3, 'max_features': 4, 'bootstrap': False}
51840.92942525009 {'n_estimators': 10, 'max_features': 4, 'bootstrap': False}

我们通过设定超参数max_features为8，n_estimators为30，得到了最佳方案。对这个组合，RMSE 的值是 49987，这比之前使用默认的超参数的值（52583）要稍微好一些

(二)随机搜索

当超参数的搜索空间很大时，最好使用RandomizedSearchCV。这个类的使用方法和类GridSearchCV很相似，但它不是尝试所有可能的组合，而是通过选择每个超参数的一个随机值的特定数量的随机组合。这个方法有两个优点：

1).如果让随机搜索运行，比如 1000 次，它会探索每个超参数的 1000 个不同的值（而不是像网格搜索那样，只搜索每个超参数的几个值）。

2).可以方便地通过设定搜索次数，控制超参数搜索的计算量。

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=42, verbose=0, warm_start=False),
          fit_params=None, iid='warn', n_iter=10, n_jobs=None,
          param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd87196710>, 'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fbd871965f8>},
          pre_dispatch='2*n_jobs', random_state=42, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)

cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

49150.657232934034 {'max_features': 7, 'n_estimators': 180}
51389.85295710133 {'max_features': 5, 'n_estimators': 15}
50796.12045980556 {'max_features': 3, 'n_estimators': 72}
50835.09932039744 {'max_features': 5, 'n_estimators': 21}
49280.90117886215 {'max_features': 7, 'n_estimators': 122}
50774.86679035961 {'max_features': 3, 'n_estimators': 75}
50682.75001237282 {'max_features': 3, 'n_estimators': 88}
49608.94061293652 {'max_features': 5, 'n_estimators': 100}
50473.57642831875 {'max_features': 3, 'n_estimators': 150}
64429.763804893395 {'max_features': 5, 'n_estimators': 2}

查看每个特征的相对重要性

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

array([7.04224175e-02, 6.02124940e-02, 4.59296052e-02, 1.55428191e-02,
       1.50341798e-02, 1.60330990e-02, 1.48669649e-02, 3.80234348e-01,
       4.54001758e-02, 1.11548704e-01, 6.46238352e-02, 1.10748235e-02,
       1.42318203e-01, 3.89612797e-05, 1.76021449e-03, 4.95915437e-03])

将重要性分数和属性名放到一起

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"] 
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.38023434841388537, 'median_income'),
 (0.14231820319691774, 'INLAND'),
 (0.11154870415426024, 'pop_per_hhold'),
 (0.07042241751016282, 'longitude'),
 (0.06462383516760929, 'bedrooms_per_room'),
 (0.060212493972978696, 'latitude'),
 (0.04592960524268814, 'housing_median_age'),
 (0.04540017578326126, 'rooms_per_hhold'),
 (0.0160330990414146, 'population'),
 (0.015542819135579315, 'total_rooms'),
 (0.015034179787027079, 'total_bedrooms'),
 (0.014866964945789509, 'households'),
 (0.011074823508549598, '<1H OCEAN'),
 (0.004959154365657991, 'NEAR OCEAN'),
 (0.0017602144944791275, 'NEAR BAY'),
 (3.896127973928007e-05, 'ISLAND')]

有了这个信息，就可以丢弃一些不那么重要的特征（比如，显然只要一个ocean_proximity的类型（INLAND）就够了，所以可以丢弃掉其它的）

七、用测试集评估模型

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

final_rmse

47997.74457886712

总结，通过以上七大步骤可以看到机器学习项目是什么样的，流程是什么样的。这中间数据处理是一个很重要的环节，我们用于训练的特征很大程度决定我们最后模型的性能。接下来希望自己能坚持下来能利用《Hands-On Machine Learning with Scikit-Learn and TensorFlow》这本书坚持下来学习机器学习和深度学习，结合官方文档查看用到的类和函数的使用，加油！

Laurel

导航

公告

利用Sklearn实现加州房产价格预测,学习运用机器学习的整个流程（包含很多细节注解）

	ocean_proximity
17606	<1H OCEAN
18632	<1H OCEAN
14650	NEAR OCEAN
3230	INLAND
3555	<1H OCEAN
19480	INLAND
8879	<1H OCEAN
13685	INLAND
4937	<1H OCEAN
4861	<1H OCEAN