数据表示与特征工程--交互特征与多项式特征
⭐想要丰富特征表示,可以通过添加原始数据的交互特征和多项式特征,尤其是对于线性模型而言。
如果想向分箱数据(4.2的内容)上的线性模型添加斜率:1、加入原始特征(图中的X轴),2、添加交互特征或乘积特征
1、加入原始特征
# 向分箱数据中加入原始特征
X,y = mglearn.datasets.make_wave(n_samples=100)
line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)
#分箱
bins = np.linspace(-3,3,11)
which_bin = np.digitize(X,bins)
encoder = OneHotEncoder(sparse=False).fit(which_bin)
X_binned = encoder.transform(which_bin)
line_binned = encoder.transform(np.digitize(line,bins))
X_combined = np.hstack([X_binned,X])
line_combined = np.hstack([line_binned,line])
X_combined.shape
'''
(100, 11)
'''
reg = LinearRegression().fit(X_combined,y)
plt.plot(line,reg.predict(line_combined),label='LinearRegression')
plt.plot(X,y,'o',c='k')
plt.vlines(bins,-3,3,alpha=.2)
plt.xlabel('feature')
plt.ylabel('output')
plt.legend()
X_combined
'''
array([[ 0. , 0. , 0. , ..., 0. ,
0. , -0.75275929],
[ 0. , 0. , 0. , ..., 0. ,
1. , 2.70428584],
[ 0. , 0. , 0. , ..., 0. ,
0. , 1.39196365],
...,
[ 0. , 0. , 0. , ..., 0. ,
0. , -0.43475389],
[ 1. , 0. , 0. , ..., 0. ,
0. , -2.84748524],
[ 0. , 1. , 0. , ..., 0. ,
0. , -2.35265144]])
'''
📣
该例子中,每个箱子都学到一个偏移和一个斜率,但发现斜率是一样的
- 因为斜率在箱子中是相同的
⭐我们更希望不同箱子的斜率不同:加入交互特征或乘积特征
2、加入交互特征或乘积特征
2.1 加入交互特征
X_product = np.hstack([X_binned,X*X_binned])
X_product.shape
#输出:(100, 20)
#现在该数据集有20个特征:数据点所在箱子的指示符,原始特征和箱子指示符的乘积
#原始特征和箱子指示符的乘积,可以看作是每个箱子x轴特征的单独副本
line_product = np.hstack([line_binned,line*line_binned])
reg = LinearRegression().fit(X_product,y)
plt.plot(line,reg.predict(line_product),label='LinearRegression')
plt.plot(X,y,'o',c='k')
plt.vlines(bins,-3,3,alpha=.2)
plt.xlabel('feature')
plt.ylabel('output')
plt.legend()
2.2 使用原始特征的多项式
⭐使用分箱是扩展连续特征的一种方法,另一种方法是:使用原始特征的多项式
在sklearn.preprocessing.PolynomialFeatures中实现
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=10,include_bias=False)
poly.fit(X)
X_poly = poly.transform(X)
X_poly.shape
#(100, 10)
print(poly.get_feature_names())
'''
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']
'''
# 将多项式与线性回归模型一起使用:多项式回归(曲线光滑)
reg = LinearRegression().fit(X_poly,y)
line_poly = poly.transform(line)
plt.plot(line,reg.predict(line_poly),label='poly regression')
plt.plot(X,y,'o',c='k')
plt.vlines(bins,-3,3,alpha=.2)
plt.xlabel('feature')
plt.ylabel('output')
plt.legend()
⭐将多项式与线性回归模型一起使用:多项式回归(曲线光滑)
reg = LinearRegression().fit(X_poly,y)
line_poly = poly.transform(line)
plt.plot(line,reg.predict(line_poly),label='poly regression')
plt.plot(X,y,'o',c='k')
plt.vlines(bins,-3,3,alpha=.2)
plt.xlabel('feature')
plt.ylabel('output')
plt.legend()
📣
多项式特征在一维数据上得到了非常平滑的拟合,但高次多项式在数据很少的区域可能有极端的表现
⭐下面用原始数据集上学到的核SVM模型作比较
from sklearn.svm import SVR
for gamma in [1,10]:
svr = SVR(gamma=gamma).fit(X,y)
plt.plot(line,svr.predict(line),label='SVR gamma={}'.format(gamma))
plt.plot(X,y,'o',c='k')
plt.xlabel('feature')
plt.ylabel('output')
plt.legend()
📣
可以发现使用比较复杂的svm模型,在不变化数据的情况下,也能得到与多项式回归类似的预测
3、波士顿房价数据集应用
⭐将多项式和交互特征应用在波士顿房价数据集上
3.1 数据集处理
#先加载数据集并进行数据缩放
from sklearn.datasets import load_boston
boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(boston.data,boston.target,random_state=0)
#缩放
scaler = MinMaxScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
#提取多项式特征和交互多项式
poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
X_train_poly = poly.transform(X_train_scaled)
X_test_poly = poly.transform(X_test_scaled)
#观察一下X_train_poly
print('orginal shape:{}'.format(X_train.shape))
print('after poly shape:{}'.format(X_train_poly.shape))
print('X_train_poly feature_names:{}'.format(poly.get_feature_names()))
'''
orginal shape:(379, 13)
after poly shape:(379, 105)
X_train_poly feature_names:['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']
'''
📣
原始特征只有13个,经过PolinomialFeatures变成了105个
- 新特征表示所有的不同原始特征之间的两两交互项
- degree=2的意思是,我们需要由最多两个原始特征的乘积组成的所有特征
3.2 建构岭回归模型
用岭回归Ridge来进行有和无多项式的对比
from sklearn.linear_model import Ridge
ridge1 = Ridge().fit(X_train_scaled,y_train)
ridge2 = Ridge().fit(X_train_poly,y_train)
print('score without interactions:{:.2}'.format(ridge1.score(X_test_scaled,y_test)))
print('score with interactions:{:.2}'.format(ridge2.score(X_test_poly,y_test)))
```
score without interactions:0.62
score with interactions:0.75
```
📣
显然运用交互多项式对Ridge的性能提升有很大帮助,但运用更复杂的模型(比如随机森林),情况又有所不同
3.3 随机森林
from sklearn.ensemble import RandomForestRegressor
rf1 = RandomForestRegressor(n_estimators=100).fit(X_train_scaled,y_train)
rf2 = RandomForestRegressor(n_estimators=100).fit(X_train_poly,y_train)
print('score without interactions:{:.3}'.format(rf1.score(X_test_scaled,y_test)))
print('score with interactions:{:.3}'.format(rf2.score(X_test_poly,y_test)))
```
score without interactions:0.806
score with interactions:0.766
```
📣
使用随机森林,即使没有交互多项式,性能也由于岭回归,而且用了交互多项式反而会降低性能
4、参考文献
《python机器学习基础教程》