scikit-learn一般实例之六:构建评估器之前进行缺失值填充

本例将会展示对确实值进行填充能比简单的对样例中缺失值进行简单的丢弃能获得更好的结果.填充不一定能提升预测精度,所以请通过交叉验证进行检验.有时删除有缺失值的记录或使用标记符号会更有效.

缺失值可以被替换为均值,中值,或使用strategy超参数最高频值.中值是对于具有可以主宰的高强度值数据是有较好鲁棒性的评估期(注:可以住在结果的高强度值一个更用用的名字是---长尾).

脚本输出:

整个数据集得分 = 0.56
不包含有缺失值的记录的得分 = 0.48
经过缺失值填充之后的得分 = 0.57

在本案例中,缺失值填充能够帮助分类器的结果更接近原始分数.

# coding:utf-8
import numpy as np

from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.model_selection import cross_val_score

rng = np.random.RandomState(0)

dataset = load_boston()
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]

#在没有缺失值的整个数据集上评估得分
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_full, y_full).mean()
print(u"整个数据集得分 = %.2f" % score)
# 在75%记录上添加确实值
missing_rate = 0.75
n_missing_samples = np.floor(n_samples * missing_rate)
missing_samples = np.hstack((np.zeros(n_samples - n_missing_samples,
                                      dtype=np.bool),
                             np.ones(n_missing_samples,
                                     dtype=np.bool)))
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

# 在没有缺失值的记录上评估得分
X_filtered = X_full[~missing_samples, :]
y_filtered = y_full[~missing_samples]
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
score = cross_val_score(estimator, X_filtered, y_filtered).mean()
print("不包含有缺失值的记录的得分 = %.2f" % score)

#填充缺失值后评估得分
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = Pipeline([("imputer", Imputer(missing_values=0,
                                          strategy="mean",
                                          axis=0)),
                      ("forest", RandomForestRegressor(random_state=0,
                                                       n_estimators=100))])
score = cross_val_score(estimator, X_missing, y_missing).mean()
print("经过缺失值填充之后的得分 = %.2f" % score)

posted @ 2016-10-04 21:48  Tacey Wong  阅读(1421)  评论(0编辑  收藏  举报