Notes : <Hands-on ML with Sklearn & TF> Chapter 2
Chapter 2 - Hoursing
Main Steps
1.Look at the big picture
2.Get the data
3.Discover and visualize the data to gain insights
4.Prepare the data for Machine Learning algorithms
5.Select a model and train it
6.Fine-tune model
7.Present solution
8.Launch, monitor, and maintain system
Frame the Problem and Look at the big picture¶
- Define the objective in business terms. ’目标‘
- How will your solution be used? 可能使用的方案
- What are the current solutions/workarounds (if any)? 当前方案
- How should you frame this problem (supervised/unsupervised, online/offline, etc.)? 问题框架
- How should performance be measured? 性能度量
- Is the performance measure aligned with the business objective? 性能度量方法和商业目标是否一致
- What would be the minimum performance needed to reach the business objective? 最低要求
- What are comparable problems? Can you reuse experience or tools? 类似的问题和可利用的经验和工具
- Is human expertise available? 专业知识
- How would you solve the problem manually? 可以手动解决吗
- List the assumptions you (or others) have made so far. 列出所做的假设
- Verify assumptions if possible. 验证这些假设
- 目标 : 收益;模型的输出(预测)传入下级独立的ML系统,判断是否值得投资 pipeline
- 当前方案 : 收集并更新district信息,复杂的规则估计value
- 问题框架 : 监督(label = house value),multivariate regression,batch(set很大时,split->MapReduce)
- 性能度量 : RMSE,MAE, l , l2, 其他的范数
- 类似问题 : Chapter 1 Example 1-1
- 假设 : 获取具体价值
- 验证假设 : need actual price not categories
Get the Data¶
Note: automate as much as possible so you can easily get fresh data.
- List the data you need and how much you need.
- Find and document where you can get that data.
- Check how much space it will take.
- Check legal obligations, and get authorization if necessary.
- Get access authorizations.
- Create a workspace (with enough storage space).
- Get the data.
- Convert the data to a format you can easily manipulate (without changing the data itself).
- Ensure sensitive information is deleted or protected (e.g., anonymized).
- Check the size and type of data (time series, sample, geographical, etc.).
- Sample a test set, put it aside, and never look at it (no data snooping!).
# 获取数据
import os
import tarfile
from six.moves import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"
def fetch_hoursing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedires(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
# urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path = housing_path)
housing_tgz.close()
fetch_hoursing_data()
# 导入数据,查看数据基本信息
import pandas as pd
def load_housing_data(housing_path = HOUSING_PATH):
csv_path = os.path.join(housing_path, "housing.csv")
return pd.read_csv(csv_path)
housing = load_housing_data()
housing.head()
housing.info()
housing["ocean_proximity"].value_counts()
housing.describe()
%matplotlib inline
#将生成的图片嵌入Jupyter notebook magic command
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
%magic
Notice:
- some data has been scaled and capped, discuss whether need the values has been capped or not
- tail-heavy distribution try transforming to bell-shaped distribution
avoid data snooping bias creat test set
import numpy as np
import numpy.random as rnd
rnd.seed(42) # to make this notebook's output identical at every run,使用相同的起源(seed)会使最后的随机序列相同
#每次都生成一个序列,每次的序列都不相同
def split_train_test(data, test_ratio):
shuffled_indices = rnd.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), len(test_set))
尽管seed(number)可以得到每次都相同的伪随机序列(Pseudo-random sequence) but these solution will break next time you fetch an updated dataset use each instance's identifier to decide whether or not it should going in test set housing data don't have identifier can use housing.reset_index()<
> or use the stable features to build a unique identifier
import hashlib
def test_set_check(identifier, test_ratio, hash):
return hash(np.int64(identifier)).digest()[-1] < 256*test_ratio #最后一个字节转化为整数和256×0.2比较,小于的划入test_set
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash)) #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
return data.loc[~in_test_set],data.loc[in_test_set] #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html#pandas.DataFrame.loc
#方案1:使用行标1...20640来计算hash
housing_with_id_1 = housing.reset_index() #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html
train_set_1, test_set_1 = split_train_test_by_id(housing_with_id_1, 0.2, "index")
print(len(train_set_1),len(test_set_1))
#方案2:使用经纬度的相加来得到id,计算hash
housing_with_id = housing.copy()
housing_with_id["id"] = housing["longitude"]*1000 + housing["latitude"]
train_set_2, test_set_2 = split_train_test_by_id(housing_with_id, 0.2, "id")
print(len(train_set_2),len(test_set_2))
#方案3:直接使用sklearn模块的方法
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(len(train_set),len(test_set))
#分层抽样(stratified sampling),represent all dataset, need sufficient number of instaces in your dataset for each stratum
housing["income_cat"] = np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5, 5.0, inplace=True) #小于5的保留,大于的归入5。http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html#pandas.DataFrame.where
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
for train_index,test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
housing["income_cat"].value_counts()/len(housing)
strat_train_set["income_cat"].value_counts()/len(strat_train_set)
for set in (strat_test_set, strat_train_set):
set.drop(["income_cat"], axis=1, inplace=True) #移除income_cat
Explore the Data¶
Note: try to get insights from a field expert for these steps.
- Create a copy of the data for exploration (sampling it down to a manageable size if necessary).
- Create a Jupyter notebook to keep a record of your data exploration.
- Study each attribute and its characteristics:
Name Type (categorical, int/float, bounded/unbounded, text, structured, etc.) % of missing values Noisiness and type of noise (stochastic, outliers, rounding errors, etc.) Possibly useful for the task? Type of distribution (Gaussian, uniform, logarithmic, etc.)
- For supervised learning tasks, identify the target attribute(s).
- Visualize the data.
- Study the correlations between attributes.
- Study how you would solve the problem manually.
- Identify the promising transformations you may want to apply.
- Identify extra data that would be useful (go back to “Get the Data”).
- Document what you have learned.
Visualizing Geographical Data
#creat a copy
housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=.1)
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"]/100, label="population", c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
Looking for Correlations 计算标准相关系数
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
#another way to check for correlation between attributes is to use Pandas'scatter_matrix
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes],figsize=(24,16))
#zoom in
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
Experimenting with Attribute Combination 查看相较之前的属性,相关系数有啥改变
#try out various attribute combinations
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]
#correlation matrix
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
Prepare the Data¶
Notes:
- Work on copies of the data (keep the original dataset intact).
- Write functions for all data transformations you apply, for five reasons:
- So you can easily prepare the data the next time you get a fresh dataset
- So you can apply these transformations in future projects
- To clean and prepare the test set
- To clean and prepare new data instances once your solution is live
- To make it easy to treat your preparation choices as hyperparameters
- Data cleaning:
- Fix or remove outliers (optional).
- Fill in missing values (e.g., with zero, mean, median…) or drop their rows (or columns).
- Feature selection (optional):
- Drop the attributes that provide no useful information for the task.
- Feature engineering, where appropriate:
- Discretize continuous features.
- Decompose features (e.g., categorical, date/time, etc.).
- Add promising transformations of features (e.g., log(x), sqrt(x), x^2, etc.).
- Aggregate features into promising new features.
- Feature scaling: standardize or normalize features.
Requement
- reproduce these transformations easily on any database
- gradually build a library of transformation functions that can be reused
- use the function in live system to transform the new data before feeding
- easily try various transformations and see which conbination of transformations works best
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
# housing.dropna(subset=["total_bedroom"]) option 1:删掉对应的district
# housing.drop("total_bedroom", axis=1) option 2:删掉整个attribute
# median = housing["total_bedroom"].median() option 3:给它一个值
# housing["total_bedroom"].fillna(median) option 3
#option 3 use scikit-learn
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy = "median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)
print(housing_num.median().values)
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns = housing_num.columns)
Scikit-Learn's API Design
- Estimator : any object that can estimate some parameters based on dataset
fit()
- the estimation itself is preformed by fit() method and it takes 1 or 2 dataset as parameter
- other patameters as hyperparameters, set as an instance variable
- Transformer :
transform()
- transform a dataset and return the transformed dataset
- generally relies on the learned parameter
- Predictor :
predict()
- takes a new dataset and return a dataset of corresponding prediction(labels on supervised)
- has a
scord()
mode that measure the quality
- Inspection
- estimator's hyperparameter can accessible directly via public instance variables
- estimator's learned parameter can accessible via public instance variables with a underscore suffix(后缀下划线)
- Nonproliferation of classes
- Datases are Numpy arrays or Scipy spare matrix
- hyperparameter is Python strings or numbers
- Composition
- existing blocks are reused as much as possible
- Sensible defaults
- provides resonable default values for most parameters, easy to creat a baseline working system
imputer.strategy
Handling Text and Catagorical Attributes
# convert these text labels to number
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded
print(encoder.classes_)
# use OneHotEncoder encoder to convert integer categorical values into one-hot vector
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
# fit_transform expect 2D array should reshape
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1))
housing_cat_1hot
housing_cat_1hot.toarray()
#使用LabelBinarizer代替以上两个
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
housing_cat_1hot = encoder.fit_transform(housing_cat) #add "sparse_output = True" get scipy sparse matrix
housing_cat_1hot
#custom transform
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): #no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix]/X[:, household_ix]
population_per_household = X[:, population_ix]/X[:, household_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix]/X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room = False)
housing_extra_attribs = attr_adder.transform(housing.values)
housing.values
housing_extra_attribs = pd.DataFrame(housing_extra_attribs, columns=list(housing.columns)+["rooms_per_household", "population_per_household"])
housing_extra_attribs.head()
Feature Scaling :
- min-max scaling : MinMaxScaler : default 0-1
- standardization : StandardScaler : 数据中心化(-mean), 离差标准化(divide variance), 数据正规化(divide freedom) ...
Transformation Pipline : do sequence of transform
#使用Pipline来对Estimators 进行 fit_transfoem()
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipline = Pipeline([('imputer', Imputer(strategy='median')),\
('attribs_adder', CombinedAttributesAdder()),\
('std_scaler', StandardScaler()),]) #三步,每一步的结果传到下一步继续执行,填充缺失->属性组合->标准化
# housing_num 表示所有的数据行
housing_num_tr = num_pipline.fit_transform(housing_num)
housing_num_tr[0:5]
# 将transform后的数据行和文本行合并
from sklearn.pipeline import FeatureUnion
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
# apply the LabelBinarizer on the categorical values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),\
('imputer', Imputer(strategy='median')),\
('attribs_adder', CombinedAttributesAdder()),\
('std_scaler', StandardScaler())])
cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),\
('label_binarizer', LabelBinarizer()),])
full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline), ("cat_pipeline", cat_pipeline)])
from sklearn.pipeline import FeatureUnion
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X[self.attribute_names].values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
num_pipeline = Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer()),
])
preparation_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
housing_prepared.shape
Select and Train a Model¶
Notes: If the data is huge, you may want to sample smaller training sets so you can train many different models in a reasonable time (be aware that this penalizes complex models such as large neural nets or Random Forests). Once again, try to automate these steps as much as possible.
- Train many quick and dirty models from different categories (e.g., linear, naive Bayes, SVM, Random Forests, neural net, etc.) using standard parameters.
- Measure and compare their performance.
- For each model, use N-fold cross-validation and compute the mean and standard deviation of the performance measure on the N folds.
- Analyze the most significant variables for each algorithm.
- Analyze the types of errors the models make.
- What data would a human have used to avoid these errors?
- Have a quick round of feature selection and engineering.
- Have one or two more quick iterations of the five previous steps. Short-list the top three to five most promising models, preferring models that make different types of errors.
# 回归
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
#prediction
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Prediation:\t", lin_reg.predict(some_data_prepared))
print("Labels:\t\t", list(some_labels))
#RMSE
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
# 决策树
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse # overfitting
# cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:\t",scores)
print("Mean:\t",scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)
display_scores(lin_rmse_scores)
Decision Tree Overfitting LinearRegression Underfitting
#Random Forest work by training mang Decision Trees on random subsets of the feature
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
from sklearn.svm import SVR
svm_reg = SVR(kernel="linear")
svm_reg.fit(housing_prepared, housing_labels)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_labels, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse
Fine-tune the System¶
Notes: You will want to use as much data as possible for this step, especially as you move toward the end of fine-tuning. As always automate what you can.
- Fine-tune the hyperparameters using cross-validation.
- Treat your data transformation choices as hyperparameters, especially when you are not sure about them (e.g., should I replace missing values with zero or with the median value? Or just drop the rows?).
- Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g., using Gaussian process priors, as described by Jasper Snoek, Hugo Larochelle, and Ryan Adams).1
- Try Ensemble methods. Combining your best models will often perform better than running them individually. Once you are confident about your final model, measure its performance on the test set to estimate the generalization error. WARNING Don’t tweak your model after measuring the generalization error: you would just start overfitting the test set.
tell it what which hyperparameters you want to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter valus, using cross-validation
# 网格try
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
grid_search.best_estimator_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
pd.DataFrame(grid_search.cv_results_)
# 随机try
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}
forest_reg = RandomForestRegressor()
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error')
rnd_search.fit(housing_prepared, housing_labels)
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
Analy the Best Modes and Their Errors
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
extra_attribs = ["rooms_per_household", "population_per_household", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
Evaluate Your System on the test set
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_transformed = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_transformed)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
Present Your Solution¶
- Document what you have done.
- Create a nice presentation.
- Make sure you highlight the big picture first.
- Explain why your solution achieves the business objective.
- Don’t forget to present interesting points you noticed along the way.
- Describe what worked and what did not.
- List your assumptions and your system’s limitations.
- Ensure your key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g., “the median income is the number-one predictor of housing prices”).
Launch!¶
- Get your solution ready for production (plug into production data inputs, write unit tests, etc.).
- Write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops.
- Beware of slow degradation too: models tend to “rot” as data evolves.
- Measuring performance may require a human pipeline (e.g., via a crowdsourcing service).
- Also monitor your inputs’ quality (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale). This is particularly important for online learning systems.
- Retrain your models on a regular basis on fresh data (automate as much as possible).
Exercises¶
- Try a Support Vector Machine regressor (sklearn.svm.SVR), with various hyperparameters such as kernel="linear" (with various values for the C hyperparameter) or kernel="rbf" (with various values for the C and gamma hyperparameters). Don't worry about what these hyperparameters mean for now. How does the best SVR predictor perform?
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
param_grid = [{"kernel" : ["linear"], "C" : [10., 50.]},
{"kernel" : ['rbf'], "C" : [300., 600.], 'gamma' : [.001]}]
svr_reg = SVR()
svr_search = GridSearchCV(svr_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=4, verbose=2)
svr_search.fit(housing_prepared, housing_labels)
svres = svr_search.cv_results_
for mean_score, params in zip(svres["mean_test_score"], svres["params"]):
print(np.sqrt(-mean_score), params)
- Try replacing GridSearchCV with RandomizedSearchCV.
svr_reg.get_params()
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import expon, reciprocal
# see https://docs.scipy.org/doc/scipy-0.19.0/reference/stats.html
# for `expon()` and `reciprocal()` documentation and more probability distribution functions.
# Note: gamma is ignored when kernel is "linear"
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200), #handson-ml answers 20000
'gamma': expon(scale=1.0),
}
svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
rnd_search.fit(housing_prepared, housing_labels)
negative_mse = rnd_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
rnd_search.best_params_
expon_distrib = expon(scale=1.)
samples = expon_distrib.rvs(10000)
plt.figure(figsize=(10, 4))
plt.subplot(121)
plt.title("Exponential distribution (scale=1.0)")
plt.hist(samples, bins=50)
plt.subplot(122)
plt.title("Log of this distribution")
plt.hist(np.log(samples), bins=50)
plt.show()
- Try adding a transformer in the preparation pipeline to select only the most important attributes.
feature selector assume you has already compute the feature importances
from sklearn.base import BaseEstimator, TransformerMixin
def indices_of_top_k(arr, k):
return np.sort(np.argpartition(np.array(arr), -k)[-k:])
class TopFeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_importances, k):
self.feature_importances = feature_importances
self.k = k
def fit(self, X, y=None):
self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
return self
def transform(self, X, y=None):
return X[:, self.feature_indices_]
#define k
k = 5
#look the selected feature
top_k_feature_indices = indices_of_top_k(feature_importances, k)
print(top_k_feature_indices)
print(np.array(attributes)[top_k_feature_indices])
sorted(zip(feature_importances, attributes), reverse=True)[:k]
#pipeline
preparation_and_feature_selection_pipeline = Pipeline([
('preparation', full_pipeline),
('feature_selection', TopFeatureSelector(feature_importances, k))
])
#fit_transform
housing_prepared_top_k_features = preparation_and_feature_selection_pipeline.fit_transform(housing)
housing_prepared_top_k_features
- Try creating a single pipeline that does the full data preparation plus the final prediction.
注意,一定要把LabelBinarizer换成Spuervision Friendly的!!!!!!
class SupervisionFriendlyLabelBinarizer(LabelBinarizer):
def fit_transform(self, X, y=None):
return super(SupervisionFriendlyLabelBinarizer, self).fit_transform(X)
# Replace the Labelbinarizer with a SupervisionFriendlyLabelBinarizer
cat_pipeline.steps[1] = ("label_binarizer", SupervisionFriendlyLabelBinarizer())
# Now you can create a full pipeline with a supervised predictor at the end.
fulll_pipeline = Pipeline([
("preparation", preparation_pipeline),
("linear", LinearRegression())
])
fulll_pipeline.fit(housing, housing_labels)
fulll_pipeline.predict(some_data)
prepare_select_and_predict_pipeline = Pipeline([
('preparation', preparation_pipeline),
('feature_selection', TopFeatureSelector(feature_importances, k)),
('svr_rege', SVR(C=122659.12862707644, gamma=0.22653313890837068, kernel='rbf')),
])
prepare_select_and_predict_pipeline.fit(housing, housing_labels)
终于找到了报错原因:没有转换为监督友好的label二值化函数!!!
- Automatically explore some preparation options using GridSearchCV.
param_grid = [
{'preparation__num_pipeline__imputer__strategy': ['mean', 'median', 'most_frequent'],
'feature_selection__k': [3, 4, 5, 6, 7]}
]
grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,
scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search_prep.fit(housing, housing_labels)
grid_search_prep.best_params_
housing.shape