3(3).特征选择---嵌入法(特征重要性评估)
一、正则化
1.L1/Lasso
L1正则方法具有稀疏解的特性,因此天然具备特征选择的特性,但是要注意,L1没有选到的特征不代表不重要,原因是两个具有高相关性的特征可能只保留了一个,如果要确定哪个特征重要应再通过L2正则方法交叉检验。
举例:下面的例子在波士顿房价数据上运行了Lasso,其中参数alpha是通过grid search进行优化
from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_boston boston = load_boston() scaler = StandardScaler() X = scaler.fit_transform(boston[ "data" ]) Y = boston[ "target" ] names = boston[ "feature_names" ] lasso = Lasso(alpha = . 3 ) lasso.fit(X, Y) print "Lasso model: " , pretty_print_linear(lasso.coef_, names, sort = True ) |
可以看到,很多特征的系数都是0。如果继续增加alpha的值,得到的模型就会越来越稀疏,即越来越多的特征系数会变成0。然而,L1正则化像非正则化线性模型一样也是不稳定的,如果特征集合中具有相关联的特征,当数据发生细微变化时也有可能导致很大的模型差异。
2.L2/Ridge
举例:
from sklearn.linear_model import Ridge from sklearn.metrics import r2_score size = 100 #We run the method 10 times with different random seeds for i in range ( 10 ): print ( "Random seed %s" % i) np.random.seed(seed = i) X_seed = np.random.normal( 0 , 1 , size) X1 = X_seed + np.random.normal( 0 , . 1 , size) X2 = X_seed + np.random.normal( 0 , . 1 , size) X3 = X_seed + np.random.normal( 0 , . 1 , size) Y = X1 + X2 + X3 + np.random.normal( 0 , 1 , size) X = np.array([X1, X2, X3]).T lr = LinearRegression() lr.fit(X,Y) print ( "Linear model:" , pretty_print_linear(lr.coef_)) ridge = Ridge(alpha = 10 ) ridge.fit(X,Y) print ( "Ridge model:" , pretty_print_linear(ridge.coef_)) |
二、基于树模型的特征重要性
1.RF
2.ExtraTree
3.Adaboost
4.GBDT
5.XGboost
1 | get_score(fmap = ' ', importance_type=' weight') |
fmap是一个包含特征名称映射关系的txt文档; importance_type指importance的计算类型;可取值有5个:
1 2 3 4 5 | ‘weight’: the number of times a feature is used to split the data across all trees. ‘gain’: the average gain across all splits the feature is used in . ‘cover’: the average coverage across all splits the feature is used in . ‘total_gain’: the total gain across all splits the feature is used in . ‘total_cover’: the total coverage across all splits the feature is used in . |
[1]importance_type=weight(默认值),某特征在所有树中作为划分属性的次数(某特征在整个树群节点中出现的次数,出现越多,价值就越高)
[2]importance_type=gain,某特征在作为划分属性时loss平均的降低量(某特征在整个树群作为分裂节点的信息增益之和再除以某特征出现的频次)
[3]importance_type=cover,某特征在作为划分属性时对样本的覆盖度(某特征节点样本的二阶导数和再除以某特征出现总频次)[4]importance_type=total_gain,同gain,average_over_splits=False,这里total_gain就是除以出现次数的gain
[5]importance_type=total_cover,同cover,average_over_splits=False,这里total_cover就是除以出现次数的gain
从构造函数中发现,xgboost sklearn API在计算特征重要性的时候默认importance_type="gain",而原始的get_score方法默认importance_type="weight"
1 2 3 4 5 6 7 8 | def __init__( self , max_depth = 3 , learning_rate = 0.1 , n_estimators = 100 , verbosity = 1 , silent = None , objective = "reg:linear" , booster = 'gbtree' , n_jobs = 1 , nthread = None , gamma = 0 , min_child_weight = 1 , max_delta_step = 0 , subsample = 1 , colsample_bytree = 1 , colsample_bylevel = 1 , colsample_bynode = 1 , reg_alpha = 0 , reg_lambda = 1 , scale_pos_weight = 1 , base_score = 0.5 , random_state = 0 , seed = None , missing = None , # 在这一步进行了声明 importance_type = "gain" , * * kwargs): |
6.LightGBM
7.RF、Xgboost、ExtraTree每个选出topk特征,再进行融合
from sklearn import ensemble from sklearn.model_selection import GridSearchCV import xgboost as xgb def get_top_k_feature(features,model,top_n_features): feature_imp_sorted_rf = pd.DataFrame({ 'feature' :features, 'importance' :model.best_estimator_.feature_importances_}).sort_values( 'importance' ,ascending = 'False' ) features_top_n = feature_imp_sorted_rf.head(top_n_features)[ 'feature' ] return features_top_n def ensemble_model_feature(X,Y,top_n_features): features = list (X) #随机森林 rf = ensemble.RandomForestRegressor() rf_param_grid = { 'n_estimators' :[ 900 ], 'random_state' :[ 2 , 4 , 6 , 8 ]} rf_grid = GridSearchCV(rf,rf_param_grid,cv = 10 ,verbose = 1 ,n_jobs = 25 ) rf_grid.fit(X,Y) top_n_features_rf = get_top_k_feature(features = features,model = rf_grid,top_n_features = top_n_features) print ( 'RF 选择完毕' ) #Adaboost abr = ensemble.AdaBoostRegressor() abr_grid = GridSearchCV(abr,rf_param_grid,cv = 10 ,n_jobs = 25 ) abr_grid.fit(X,Y) top_n_features_bgr = get_top_k_feature(features = features,model = abr_grid,top_n_features = top_n_features) print ( 'Adaboost选择完毕' ) #ExtraTree etr = ensemble.ExtraTreesRegressor() etr_grid = GridSearchCV(etr,rf_param_grid,cv = 10 ,n_jobs = 25 ) etr_grid.fit(X,Y) top_n_features_etr = get_top_k_feature(features = features,model = etr_grid,top_n_features = top_n_features) print ( 'ExtraTree选择完毕' ) #融合以上3个模型 features_top_n = pd.concat([top_n_features_rf,top_n_features_bgr,top_n_features_etr],ignore_index = True ).drop_duplicates() print (features_top_n) print ( len (features_top_n)) return features_top_n |
参考文献:
【1】树模型特征重要性评估方法
【4】机器学习的特征重要性究竟是怎么算的(知乎)
【5】特征工程
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 使用C#创建一个MCP客户端
· ollama系列1:轻松3步本地部署deepseek,普通电脑可用
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· 按钮权限的设计及实现