数据表示与特征工程--交互特征与多项式特征

⭐想要丰富特征表示，可以通过添加原始数据的交互特征和多项式特征，尤其是对于线性模型而言。

如果想向分箱数据（4.2的内容）上的线性模型添加斜率：1、加入原始特征（图中的X轴），2、添加交互特征或乘积特征

1、加入原始特征

  # 向分箱数据中加入原始特征

  X,y = mglearn.datasets.make_wave(n_samples=100)
  line = np.linspace(-3,3,1000,endpoint=False).reshape(-1,1)

  #分箱

  bins = np.linspace(-3,3,11)
  which_bin = np.digitize(X,bins)

  encoder = OneHotEncoder(sparse=False).fit(which_bin)
  X_binned = encoder.transform(which_bin)
  line_binned = encoder.transform(np.digitize(line,bins))

  X_combined = np.hstack([X_binned,X])
  line_combined = np.hstack([line_binned,line])

  X_combined.shape
  '''
  (100, 11)
  '''

  reg = LinearRegression().fit(X_combined,y)

  plt.plot(line,reg.predict(line_combined),label='LinearRegression')
  plt.plot(X,y,'o',c='k')
  plt.vlines(bins,-3,3,alpha=.2)
  plt.xlabel('feature')
  plt.ylabel('output')
  plt.legend()

  X_combined
  '''
  array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        , -0.75275929],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           1.        ,  2.70428584],
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        ,  1.39196365],
         ...,
         [ 0.        ,  0.        ,  0.        , ...,  0.        ,
           0.        , -0.43475389],
         [ 1.        ,  0.        ,  0.        , ...,  0.        ,
           0.        , -2.84748524],
         [ 0.        ,  1.        ,  0.        , ...,  0.        ,
           0.        , -2.35265144]])
  '''

📣
该例子中，每个箱子都学到一个偏移和一个斜率，但发现斜率是一样的

因为斜率在箱子中是相同的

⭐我们更希望不同箱子的斜率不同：加入交互特征或乘积特征

2、加入交互特征或乘积特征

2.1 加入交互特征

  X_product = np.hstack([X_binned,X*X_binned])
  X_product.shape
  #输出：(100, 20)

  #现在该数据集有20个特征：数据点所在箱子的指示符，原始特征和箱子指示符的乘积
  #原始特征和箱子指示符的乘积，可以看作是每个箱子x轴特征的单独副本


  line_product = np.hstack([line_binned,line*line_binned])
  reg = LinearRegression().fit(X_product,y)

  plt.plot(line,reg.predict(line_product),label='LinearRegression')
  plt.plot(X,y,'o',c='k')
  plt.vlines(bins,-3,3,alpha=.2)
  plt.xlabel('feature')
  plt.ylabel('output')
  plt.legend()

2.2 使用原始特征的多项式

⭐使用分箱是扩展连续特征的一种方法，另一种方法是：使用原始特征的多项式
在sklearn.preprocessing.PolynomialFeatures中实现

  from sklearn.preprocessing import PolynomialFeatures

  poly = PolynomialFeatures(degree=10,include_bias=False)
  poly.fit(X)
  X_poly = poly.transform(X)
  X_poly.shape
  #(100, 10)

  print(poly.get_feature_names())

  '''
  ['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']

  '''

  # 将多项式与线性回归模型一起使用：多项式回归（曲线光滑）

  reg = LinearRegression().fit(X_poly,y)

  line_poly = poly.transform(line)

  plt.plot(line,reg.predict(line_poly),label='poly regression')
  plt.plot(X,y,'o',c='k')
  plt.vlines(bins,-3,3,alpha=.2)
  plt.xlabel('feature')
  plt.ylabel('output')
  plt.legend()

⭐将多项式与线性回归模型一起使用：多项式回归（曲线光滑）

  reg = LinearRegression().fit(X_poly,y)

  line_poly = poly.transform(line)

  plt.plot(line,reg.predict(line_poly),label='poly regression')
  plt.plot(X,y,'o',c='k')
  plt.vlines(bins,-3,3,alpha=.2)
  plt.xlabel('feature')
  plt.ylabel('output')
  plt.legend()

📣
多项式特征在一维数据上得到了非常平滑的拟合，但高次多项式在数据很少的区域可能有极端的表现

⭐下面用原始数据集上学到的核SVM模型作比较

  from sklearn.svm import SVR

  for gamma in [1,10]:
      svr = SVR(gamma=gamma).fit(X,y)
      plt.plot(line,svr.predict(line),label='SVR gamma={}'.format(gamma))

  plt.plot(X,y,'o',c='k')
  plt.xlabel('feature')
  plt.ylabel('output')
  plt.legend()

📣
可以发现使用比较复杂的svm模型，在不变化数据的情况下，也能得到与多项式回归类似的预测

3、波士顿房价数据集应用

⭐将多项式和交互特征应用在波士顿房价数据集上

3.1 数据集处理

  #先加载数据集并进行数据缩放

  from sklearn.datasets import load_boston

  boston = load_boston()
  X_train, X_test, y_train, y_test = train_test_split(boston.data,boston.target,random_state=0)

  #缩放
  scaler = MinMaxScaler().fit(X_train)
  X_train_scaled = scaler.transform(X_train)
  X_test_scaled = scaler.transform(X_test)

  #提取多项式特征和交互多项式
  poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
  X_train_poly = poly.transform(X_train_scaled)
  X_test_poly = poly.transform(X_test_scaled)


  #观察一下X_train_poly

  print('orginal shape:{}'.format(X_train.shape))
  print('after poly shape:{}'.format(X_train_poly.shape))
  print('X_train_poly feature_names:{}'.format(poly.get_feature_names()))

  '''
  orginal shape:(379, 13)
  after poly shape:(379, 105)
  X_train_poly feature_names:['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2', 'x0 x3', 'x0 x4', 'x0 x5', 'x0 x6', 'x0 x7', 'x0 x8', 'x0 x9', 'x0 x10', 'x0 x11', 'x0 x12', 'x1^2', 'x1 x2', 'x1 x3', 'x1 x4', 'x1 x5', 'x1 x6', 'x1 x7', 'x1 x8', 'x1 x9', 'x1 x10', 'x1 x11', 'x1 x12', 'x2^2', 'x2 x3', 'x2 x4', 'x2 x5', 'x2 x6', 'x2 x7', 'x2 x8', 'x2 x9', 'x2 x10', 'x2 x11', 'x2 x12', 'x3^2', 'x3 x4', 'x3 x5', 'x3 x6', 'x3 x7', 'x3 x8', 'x3 x9', 'x3 x10', 'x3 x11', 'x3 x12', 'x4^2', 'x4 x5', 'x4 x6', 'x4 x7', 'x4 x8', 'x4 x9', 'x4 x10', 'x4 x11', 'x4 x12', 'x5^2', 'x5 x6', 'x5 x7', 'x5 x8', 'x5 x9', 'x5 x10', 'x5 x11', 'x5 x12', 'x6^2', 'x6 x7', 'x6 x8', 'x6 x9', 'x6 x10', 'x6 x11', 'x6 x12', 'x7^2', 'x7 x8', 'x7 x9', 'x7 x10', 'x7 x11', 'x7 x12', 'x8^2', 'x8 x9', 'x8 x10', 'x8 x11', 'x8 x12', 'x9^2', 'x9 x10', 'x9 x11', 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']

  '''

📣

原始特征只有13个，经过PolinomialFeatures变成了105个

新特征表示所有的不同原始特征之间的两两交互项
degree=2的意思是，我们需要由最多两个原始特征的乘积组成的所有特征

3.2 建构岭回归模型

用岭回归Ridge来进行有和无多项式的对比

  from sklearn.linear_model import Ridge

  ridge1 = Ridge().fit(X_train_scaled,y_train)
  ridge2 = Ridge().fit(X_train_poly,y_train)

  print('score without interactions:{:.2}'.format(ridge1.score(X_test_scaled,y_test)))
  print('score with interactions:{:.2}'.format(ridge2.score(X_test_poly,y_test)))

  ```
  score without interactions:0.62
  score with interactions:0.75
  ```

📣

显然运用交互多项式对Ridge的性能提升有很大帮助，但运用更复杂的模型（比如随机森林），情况又有所不同

3.3 随机森林

  from sklearn.ensemble import RandomForestRegressor

  rf1 = RandomForestRegressor(n_estimators=100).fit(X_train_scaled,y_train)
  rf2 = RandomForestRegressor(n_estimators=100).fit(X_train_poly,y_train)
  print('score without interactions:{:.3}'.format(rf1.score(X_test_scaled,y_test)))
  print('score with interactions:{:.3}'.format(rf2.score(X_test_poly,y_test)))

  ```
  score without interactions:0.806
  score with interactions:0.766
  ```

📣
使用随机森林，即使没有交互多项式，性能也由于岭回归，而且用了交互多项式反而会降低性能

4、参考文献
《python机器学习基础教程》

posted @ 2022-05-13 11:28 朝南烟阅读(269) 评论(0) 编辑收藏举报

刷新页面返回顶部

cly的园子