特征选择

# -*- coding: utf-8 -*-
"""
Created on Wed Aug 10 20:26:15 2016

@author: qqhfeng
"""

#模块1 VarianceThreshold 选择特征值
'''
Feature selector that removes all low-variance features. 
This feature selection algorithm looks only at the features (X), 
not the desired outputs (y), and can thus be used for unsupervised learning.

VarianceThreshold is a simple baseline approach to feature selection. 
It removes all features whose variance doesn’t meet some threshold.
By default, it removes all zero-variance features, i.e. 
features that have the same value in all samples. 
As an example, suppose that we have a dataset with boolean features, 
and we want to remove all features that are either one or zero (on or off) 
in more than 80% of the samples. Boolean features are Bernoulli random variables,
and the variance of such variables is given by
'''

from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
#sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
sel = VarianceThreshold()
print sel.fit_transform(X)




#模块2 选择最重要的 SelectKBest removes all but the k highest scoring features
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
print X.shape
X_new = SelectKBest(chi2, k=2).fit_transform(X, y) #chi2是一种特征重要性评价方法
print X_new.shape



#模块3 递归特征消除法
posted on 2016-08-10 20:44 qqhfeng16 阅读(431) 评论(0) 编辑收藏举报