FeatureUnion 与 ColumnTransformer 关系

from future import print_function
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest

iris = load_iris()
X, y = iris.data, iris.target

This dataset is way too high-dimensional. Better do PCA:

pca = PCA(n_components=2)

Maybe some original features where good, too?

selection = SelectKBest(k=1)

Build estimator from PCA and Univariate selection:

combined_features = FeatureUnion([("pca", pca), ("univ_select", selection)])
ct_f = ColumnTransformer([("pca", pca, [0,1,2,3]), ("univ_select", selection, [0,1,2,3])])

Use combined features to transform dataset:

X_features = combined_features.fit(X, y).transform(X)
print("Combined space has", X_features.shape[1], "features")

X_features2 = ct_f.fit(X, y).transform(X)
print("Combined space has", X_features2.shape[1], "features")

for i in range(20):
print(X_features[i],"--> ",X_features2[i],X_features[i]-X_features2[i])

TODO(yu):这里又两种交叉模式，还有一种完全展开形式; a.一个特征处理所有列，多个特征重复然后合并；

b.一列同时计算除多个特征，然后多列合并；c.特征和列完全交叉展开笛卡儿积，每个对应一个列转换元组；

a符合基本设计可以利用ColumnTransformer，导致会多次; b特征计算可以共享FFT计算结果FFT,耦合度高,实现麻烦些; c最不划算的做法FFT

为了方便，选a，显然c最灵活，可以任意指定列和特征，而a.b只能灵活指定一维。

FeatureUnion主要解决的是多种特征的合并，ColumnTransformer主要解决列的指定问题，而Pipeline主要解决竖直方向连接的问题

三者结合很有用,但是ColumnTransformer 似乎可以实现 FeatureUnion 的功能？

posted @ 2018-11-14 16:29 Lucas_Yu 阅读(957) 评论(0) 收藏举报

刷新页面返回顶部

Lucas_Yu

The blogs of Mr6 cover his work in clinical domain with the toolkit including statistics and machine learning and DoE etc..