FeatureUnion 与 ColumnTransformer 关系

from future import print_function
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris()
X, y = iris.data, iris.target

This dataset is way too high-dimensional. Better do PCA:

pca = PCA(n_components=2)

Maybe some original features where good, too?

selection = SelectKBest(k=1)

Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
ct_f = ColumnTransformer([("pca", pca, [0,1,2,3]), ("univ_select", selection, [0,1,2,3])])

Use combined features to transform dataset:

X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

X_features2 = ct_f.fit(X, y).transform(X)
print("Combined space has", X_features2.shape[1], "features")

for i in range(20):
print(X_features[i],"--> ",X_features2[i],X_features[i]-X_features2[i])

TODO(yu):这里又两种交叉模式,还有一种完全展开形式; a.一个特征处理所有列,多个特征重复然后合并;

b.一列同时计算除多个特征,然后多列合并;c.特征和列完全交叉展开笛卡儿积,每个对应一个列转换元组;

a符合基本设计可以利用ColumnTransformer,导致会多次; b特征计算可以共享FFT计算结果FFT,耦合度高,实现麻烦些; c最不划算的做法FFT

为了方便,选a,显然c最灵活,可以任意指定列和特征,而a.b只能灵活指定一维。

FeatureUnion主要解决的是多种特征的合并,ColumnTransformer主要解决列的指定问题,而Pipeline主要解决竖直方向连接的问题

三者结合很有用,但是ColumnTransformer 似乎可以实现 FeatureUnion 的功能?

posted @ 2018-11-14 16:29  Lucas_Yu  阅读(925)  评论(0编辑  收藏  举报