【Machine Learning】加州房价预测

 

这个分析有点乱,感觉不是一个很好的例子,到最后不想跟了,已经算一个比较完整的流程

Housing

 

 

 

1. 导入数据分析包并读取数据

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
In [2]:
housing = pd.read_csv('housing.csv', sep=',')
housing.head()
Out[2]:
 
 longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
 

2. 数据概览

2.1 数据类型和空值检查

In [3]:
housing.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
 

我们可以注意到:

该数据集总共有20640个样本,只有 total_bedrooms 含有非空值

除了最后一栏 ocean_proximity, 其余的数据类型都是数值型,而且容易发现 ocean_proximity 是文本类型,我们可以通过 value_counts() 方法来查看它是否为分类类型

In [4]:
housing['ocean_proximity'].value_counts()
Out[4]:
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
 

果然,ocean_proximity 是分类类型,一共有五个值: <1H OCEAN, INLAND, NEAR OCEAN, NEAR BAR, ISLAND

 

2.2 数据基本统计值

In [5]:
housing.describe()
Out[5]:
 
 longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count 20640.000000 20640.000000 20640.000000 20640.000000 20433.000000 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081 537.870553 1425.476744 499.539680 3.870671 206855.816909
std 2.003532 2.135952 12.585558 2181.615252 421.385070 1132.462122 382.329753 1.899822 115395.615874
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.800000 33.930000 18.000000 1447.750000 296.000000 787.000000 280.000000 2.563400 119600.000000
50% -118.490000 34.260000 29.000000 2127.000000 435.000000 1166.000000 409.000000 3.534800 179700.000000
75% -118.010000 37.710000 37.000000 3148.000000 647.000000 1725.000000 605.000000 4.743250 264725.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000
In [6]:
housing.hist(bins=50, figsize=(20,15))
plt.show()
 
 

3. 将数据集划分为训练集和测试集

我们在训练机器学习模型的时候,需要一些数据来检验我们的模型,所以我们需要将数据集分为训练集和测试集,通常我们将 20% 的数据集作为测试集。

3.1 随机获取数据集中的 20% 作为训练集

In [7]:
test = np.random.permutation(5)
print(test)
 
[0 4 1 3 2]
In [8]:
def split_train_set(data, test_rate=0.2):
    shuffled_indices = np.random.permutation(len(data))  
    test_size = int(len(data) * test_rate)
    test_indices = shuffled_indices[:test_size]
    train_indices = shuffled_indices[test_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
train_data, test_data = split_train_set(housing)
print("train set count: %d, test set count: %d" % (len(train_data), len(test_data)))
 
train set count: 16512, test set count: 4128
 

这个方法并不完美,每次运行你都会生成不一样的测试集和数据集。 我们希望每次运行这个方法生成的训练集和测试集是一样的,这样便于我们在调整算法或者使用不同的训练集时比较算法的优劣。 为了达到这个目的,
  我们可以在第一次使用该方法时将训练集和测试集保存为文件,然后每次从文件中读取数据。
  可以在使用 np.random.permutation() 方法之前,调用 np.random.seed() 方法
Scikit-Learn提供了一些函数,可以通过多种方式将数据集分成多个子集。最简单的函数是train_test_split,它与前面定义的函数 split_train_test几乎相同,除了几个额外特征。首先,它也有random_state参数,让你可以像之前提到过的那样设置随机生成器种子;其次,你可以把行数相同的多个数据集一次性发送给它,它会根据相同的索引将其拆分

In [9]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(housing, test_size=0.2, random_state=42)
print("train set count: %d, test set count: %d" % (len(train_data), len(test_data)))
 
train set count: 16512, test set count: 4128
 

3.2 分层抽样划分训练集和测试集

当数据分布均匀的时候,随机抽样是一个不错的方法。但是当数据分布不均匀的时候,我们需要考虑使用分层抽样,比如一个国家有5个地区,人口比列是 5%,15%,30%,20%,30%,我们希望选取的样本也满足这种分布,这样抽取出来的样本能更好地代表整个数据集。
如果你咨询专家,他们会告诉你,要预测房价平均值,收入中位数是一个非常重要的属性。于是你希望确保在收入属性上,测试集能够代表整个数据集中各种不同类型的收入。由于收入中位数是一个连续的数值属性,所以你得先创建一个收入类别的属性

In [10]:
housing.hist(column='median_income')
Out[10]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f35fe48ae90>]],
      dtype=object)
 
In [11]:
housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
In [12]:
housing.hist(column='income_cat')
Out[12]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3601150910>]],
      dtype=object)
 
 

使用Scikit-Learn的StratifiedShuffleSplit进行分层抽样 class sklearn.model_selection.StratifiedShuffleSplit(n_splits=10, test_size=None, train_size=None, random_state=None) image.png

In [13]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_data = housing.loc[train_index]
    strat_test_data = housing.loc[test_index]
    
# 检查效果
strat_train_data.hist(column='income_cat')
strat_test_data.hist(column='income_cat')
Out[13]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3601150e50>]],
      dtype=object)
 
 
 

可以看到训练集和测试集的在income_cat上的分布是一样的,跟未划分之前的数据集的分布也是一样的

In [14]:
# 恢复数据,删除 income_cat 列
for dataset in (strat_train_data, strat_test_data, housing):
    dataset.drop(['income_cat'], axis=1, inplace=True)
 

4. 深入理解数据

 

4.1 将地理数据可视化

In [15]:
housing.plot(kind='scatter',x='longitude', y='latitude')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f36003a5150>
 
In [16]:
housing.plot(kind='scatter',x='longitude', y='latitude', alpha=0.1)
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3600c6a8d0>
 
In [17]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()
Out[17]:
<matplotlib.legend.Legend at 0x7f3600bd3f50>
 
 

4.2 查看属性的相关性

 

由于数据集不大,你可以使用corr()方法轻松计算出每对属性之间的标准相关系数(也称为皮尔逊相关系数):

In [18]:
corr_matrix = housing.corr()
print(corr_matrix)
 
                    longitude  latitude  housing_median_age  total_rooms  \
longitude            1.000000 -0.924664           -0.108197     0.044568   
latitude            -0.924664  1.000000            0.011173    -0.036100   
housing_median_age  -0.108197  0.011173            1.000000    -0.361262   
total_rooms          0.044568 -0.036100           -0.361262     1.000000   
total_bedrooms       0.069608 -0.066983           -0.320451     0.930380   
population           0.099773 -0.108785           -0.296244     0.857126   
households           0.055310 -0.071035           -0.302916     0.918484   
median_income       -0.015176 -0.079809           -0.119034     0.198050   
median_house_value  -0.045967 -0.144160            0.105623     0.134153   

                    total_bedrooms  population  households  median_income  \
longitude                 0.069608    0.099773    0.055310      -0.015176   
latitude                 -0.066983   -0.108785   -0.071035      -0.079809   
housing_median_age       -0.320451   -0.296244   -0.302916      -0.119034   
total_rooms               0.930380    0.857126    0.918484       0.198050   
total_bedrooms            1.000000    0.877747    0.979728      -0.007723   
population                0.877747    1.000000    0.907222       0.004834   
households                0.979728    0.907222    1.000000       0.013033   
median_income            -0.007723    0.004834    0.013033       1.000000   
median_house_value        0.049686   -0.024650    0.065843       0.688075   

                    median_house_value  
longitude                    -0.045967  
latitude                     -0.144160  
housing_median_age            0.105623  
total_rooms                   0.134153  
total_bedrooms                0.049686  
population                   -0.024650  
households                    0.065843  
median_income                 0.688075  
median_house_value            1.000000  
 

查看每个属性与房价中位数的关系

In [19]:
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[19]:
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64
 

相关系数的范围从-1变化到1。越接近1,表示有越强的正相关;比如,当收入中位数上升时,房价中位数也趋于上升。当系数接近于-1,则表示有强烈的负关;注意看纬度和房价中位数之间呈现出轻微的负相关(也就是说,越往北走,房价倾向于下降)。最后,系数靠近0则说明二者之间没有线性相关性。

 

我们也可以使用Pandas的scatter_matrix函数,它会绘制出每个数值属性相对于其他数值属性的相关性。这里我们仅关注那些与房价中位数属性最相关的,可算作是最有潜力的属性

In [23]:
from pandas.plotting import scatter_matrix
attrs =  ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attrs], figsize=(12, 8))
Out[23]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f3600d328d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600dff9d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3601029cd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600efcfd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3600ff7310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600db9850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600b79ad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600eb1710>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3600acbe10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600ce8790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f35fe372f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f3600d43790>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f3600b29f90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f36011007d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f360117cfd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f36011eb810>]],
      dtype=object)
 
 

由图可知 相关性最强的是 median_income,放大来看, 图中有一些数据比如 median_house_value 为 500000,450000,350000 时,散点图为一条明显的直线,这是因为数据设置了房价上限,这会影响我们的预测,所以我们可以考虑将这些干扰点去除

In [24]:
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3600b47990>
 
 

4.3 探索属性的隐藏信息

 

在准备给机器学习算法输入数据之前,你要做的最后一件事应该是尝试各种属性的组合。比如,如果你不知道一个地区有多少个家庭,那么知道一个地区的“房间总数”也没什么用。你真正想要知道的是一个家庭的房间数量。同样地,单看“卧室总数”这个属性本身,也没什么意义,你可能是想拿它和“房间总数”来对比,或者拿来同“每个家庭的人口数”这个属性结合也似乎挺有意思。我们来试着创建这些新属性:

In [25]:
housing["rooms_per_household"] = housing["total_rooms"] / housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"] / housing["total_rooms"]
housing["population_per_household"]=housing["population"] / housing["households"]
In [26]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Out[26]:
median_house_value          1.000000
median_income               0.688075
rooms_per_household         0.151948
total_rooms                 0.134153
housing_median_age          0.105623
households                  0.065843
total_bedrooms              0.049686
population_per_household   -0.023737
population                 -0.024650
longitude                  -0.045967
latitude                   -0.144160
bedrooms_per_room          -0.255880
Name: median_house_value, dtype: float64
 

5. 机器学习算法的数据准备

现在,终于是时候给你的机器学习算法准备数据了。这里你应该编写函数来执行,而不是手动操作,原因如下:
  ·你可以在任何数据集上轻松重现这些转换(例如,获得更新的数据库之后)。
  ·你可以逐渐建立起一个转换函数的函数库,在以后的项目中可以重用。
  ·你可以在实时系统(live system)中使用这些函数来转换新数据,再喂给算法。
  ·你可以轻松尝试多种转换方式,查看哪种转换的组合效果最佳。

5.1 准备一个干净的训练集,并将特征值和标签值分开

In [28]:
housing = strat_train_data.drop("median_house_value", axis=1)
housing_labels = strat_train_data["median_house_value"].copy()
 

5.2 处理缺失值

处理缺失值有三个选择:
  放弃对应的记录
  放弃对应的属性
  将缺失值设置为某个特定的值, 通常设置为 0、均值或者中位数
通过 DataFrame 的 dropna()、drop()、fillna() 可以轻松完成这些操作,这里我们使用 第三个办法

In [29]:
# housing.dropna(subset=["total_bedrooms"]) # option 1
# housing.drop("total_bedrooms", axis=1) # option 2
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median) # option 3
Out[29]:
17606     351.0
18632     108.0
14650     471.0
3230      371.0
3555     1525.0
          ...  
6563      236.0
12053     294.0
13908     872.0
11159     380.0
15775     682.0
Name: total_bedrooms, Length: 16512, dtype: float64
 

Scikit-Learn提供了一个非常容易上手的教程来处理缺失值:imputer。使用方法如下,首先,你需要创建一个imputer实例,指定你要用属性的中位数值替换该属性的缺失值:

In [34]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
# imputer 适用于数值型,先把文本类的列去掉
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)    # 计算数据集中位数,并保存在 statistics_ 属性中

print(imputer.statistics_)
 
[-118.51     34.26     29.     2119.5     433.     1164.      408.
    3.5409]
In [36]:
# 对缺失值进行补充
X = imputer.transform(housing_num)
# X 是一个 Numpy 数组,将其重新转换为 DataFrame
housing_tr = pd.DataFrame(X, columns=housing_num.columns)
housing_tr.info()
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16512 entries, 0 to 16511
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16512 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
dtypes: float64(8)
memory usage: 1.0 MB
 

可以看到已经不存在none值了

 

5.3 处理文本和分类类型

之前我们发现 ocean_proximity 是一个文本类型,而且它是一个分类标签,现在我们把这些标签转换为数值型标签。

5.3.1 Scikit-Learn为这类任务提供了一个转换器LabelEncoder:

In [39]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing['ocean_proximity']
housing_cat_encoded = encoder.fit_transform(housing_cat)
print(housing_cat_encoded)
print(encoder.classes_)
 
[0 0 4 ... 1 0 3]
['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']
 

这个标签转换器没有处理分类之间的相似性,Scikit-Learn 提供了一个 OneHotEncoder 编码器可以处理这种情况

5.3.2 OneHotEncoder

In [44]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
print(housing_cat_1hot)    # 得到的是一个Scipy稀疏矩阵,只保存元素不为 0 的 位置和值
 
  (0, 0)	1.0
  (1, 0)	1.0
  (2, 4)	1.0
  (3, 1)	1.0
  (4, 0)	1.0
  (5, 1)	1.0
  (6, 0)	1.0
  (7, 1)	1.0
  (8, 0)	1.0
  (9, 0)	1.0
  (10, 1)	1.0
  (11, 1)	1.0
  (12, 0)	1.0
  (13, 1)	1.0
  (14, 1)	1.0
  (15, 0)	1.0
  (16, 3)	1.0
  (17, 1)	1.0
  (18, 1)	1.0
  (19, 1)	1.0
  (20, 0)	1.0
  (21, 0)	1.0
  (22, 0)	1.0
  (23, 1)	1.0
  (24, 1)	1.0
  :	:
  (16487, 1)	1.0
  (16488, 1)	1.0
  (16489, 4)	1.0
  (16490, 3)	1.0
  (16491, 0)	1.0
  (16492, 3)	1.0
  (16493, 1)	1.0
  (16494, 1)	1.0
  (16495, 0)	1.0
  (16496, 1)	1.0
  (16497, 3)	1.0
  (16498, 1)	1.0
  (16499, 0)	1.0
  (16500, 0)	1.0
  (16501, 0)	1.0
  (16502, 4)	1.0
  (16503, 0)	1.0
  (16504, 1)	1.0
  (16505, 1)	1.0
  (16506, 0)	1.0
  (16507, 1)	1.0
  (16508, 1)	1.0
  (16509, 1)	1.0
  (16510, 0)	1.0
  (16511, 3)	1.0
In [45]:
# 使用 toarray()方法转换成一个 Numpy 密集矩阵
print(housing_cat_1hot.toarray())
 
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 ...
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]
 

5.3.3 LabelBinarizer

可以直接输出一个 numpy 矩阵

In [46]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat)
print(housing_cat_1hot)
 
[[1 0 0 0 0]
 [1 0 0 0 0]
 [0 0 0 0 1]
 ...
 [0 1 0 0 0]
 [1 0 0 0 0]
 [0 0 0 1 0]]
 

5.3.4 自定义转换器

In [47]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
In [48]:
print(housing_extra_attribs)
 
[[-121.89 37.29 38.0 ... '<1H OCEAN' 4.625368731563422 2.094395280235988]
 [-121.93 37.05 14.0 ... '<1H OCEAN' 6.008849557522124 2.7079646017699117]
 [-117.2 32.77 31.0 ... 'NEAR OCEAN' 4.225108225108225 2.0259740259740258]
 ...
 [-116.4 34.09 9.0 ... 'INLAND' 6.34640522875817 2.742483660130719]
 [-118.01 33.82 31.0 ... '<1H OCEAN' 5.50561797752809 3.808988764044944]
 [-122.45 37.77 52.0 ... 'NEAR BAY' 4.843505477308295 1.9859154929577465]]
 

5.4 特征缩放

 

归一化:Scikit-Learn提供了一个名为MinMaxScaler的转换器。如果出于某种原因,你希望范围不是0~1,你可以通过调整超参数feature_range进行更改。 标准化:。Scikit-Learn提供了一个标准化的转换器StandadScaler。

 

5.5 数据处理流水线

Scikit-Learn提供了Pipeline来支持这样的转换。下面是一个数值属性的流水线例子:

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy="median")),
                         ('attribs_adder', CombinedAttributesAdder()),
                         ('std_scaler', StandardScaler()),])
housing_num_tr = num_pipeline.fit_transform(housing_num)
print(housing_num_tr)
 
[[-1.15604281  0.77194962  0.74333089 ... -0.31205452 -0.08649871
   0.15531753]
 [-1.17602483  0.6596948  -1.1653172  ...  0.21768338 -0.03353391
  -0.83628902]
 [ 1.18684903 -1.34218285  0.18664186 ... -0.46531516 -0.09240499
   0.4222004 ]
 ...
 [ 1.58648943 -0.72478134 -1.56295222 ...  0.3469342  -0.03055414
  -0.52177644]
 [ 0.78221312 -0.85106801  0.18664186 ...  0.02499488  0.06150916
  -0.30340741]
 [-1.43579109  0.99645926  1.85670895 ... -0.22852947 -0.09586294
   0.10180567]]
 

此外,Scikit-Learn还提供了一个FeatureUnion类,将不同的 pipeline 组织起来. 每条子流水线从选择器转换器开始:只需要挑出所需的属性(数值或分类),删除其余的数据,然后将生成的DataFrame转换为NumPy数组,数据转换就完成了。Scikit-Learn中没有可以用来处理Pandas DataFrames的,因此我们需要为此任务编写一个简单的自定义转换器:

In [67]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names=attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
    
class MyLabelBinarizer(TransformerMixin):
    def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)
    
from sklearn.pipeline import FeatureUnion

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),
                         ('imputer', SimpleImputer(strategy="median")),
                         ('attribs_adder', CombinedAttributesAdder()),
                         ('std_scaler', StandardScaler()),])
cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),
                         ('label_binarizer', MyLabelBinarizer()),])
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),
                                               ("cat_pipeline", cat_pipeline),])
housing_prepared = full_pipeline.fit_transform(housing)
print(type(housing_prepared))
print(np.shape(housing_prepared))
 
<class 'numpy.ndarray'>
(16512, 16)
 

6. 训练模型

 

6.1 LinearRegression

In [60]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Out[60]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [74]:
# 使用该模型做一些预测
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print('Predictions: ', lin_reg.predict(housing_prepared[:5]))
print('Acctual Labels: ', list(some_label))
 
Predictions:  [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
Acctual Labels:  [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
 

我们可以使用Scikit-Learn的mean_squared_error函数来测量整个训练集上回归模型的RMSE(标准误差):

In [75]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print(lin_rmse)
 
68628.19819848922
 

这说明我们的预测误差达到了 68628 美元

6.2 DecisionTreeRegressor

In [76]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
Out[76]:
DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')
In [ ]:
 
posted @ 2020-04-21 14:40  早起的虫儿去吃鸟  阅读(925)  评论(0编辑  收藏  举报