Boruta特征选择

Boruta特征选择

官方github地址:https://github.com/scikit-learn-contrib/boruta_py?tab=readme-ov-file

论文地址:https://www.jstatsoft.org/article/view/v036i11

官方代码:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

# load X and y
# NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
X = pd.read_csv('examples/test_X.csv', index_col=0).values
y = pd.read_csv('examples/test_y.csv', header=None, index_col=0).values
y = y.ravel()

# define random forest classifier, with utilising all cores and
# sampling in proportion to y labels
rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

# define Boruta feature selection method
feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)

# find all relevant features - 5 features should be selected
feat_selector.fit(X, y)

# check selected features - first 5 features are selected
feat_selector.support_

# check ranking of features
feat_selector.ranking_

# call transform() on X to filter it down to selected features
X_filtered = feat_selector.transform(X)

在本地运行时出现了问题:AttributeError: module 'numpy' has no attribute 'int'. np.int was a deprecated alias for the builtin int.就是numpy的1.20版本以后的都不在支持np.int,我尝试了降低numpy版本,但是报错wheel出问题了。看了github上的issues很多人都遇到了同样的问题,解决办法就是在调用boruta = BorutaPy(estimator=rf)前加三行代码:

np.int = np.int32
np.float = np.float64
np.bool = np.bool_

boruta = BorutaPy(estimator=rf)
boruta.fit(x, y)

下面是我修改后以及适配我的需求的代码:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
import numpy as np

file_names_to_add = ['xxx', 'xxxx']
file_path2 = '../xxxx'

for file_name in file_names_to_add:
    input_file_path = f"{file_path2}{file_name}.xlsx"
    print(input_file_path) 

    sheet_name_nor = 'xxx'

    y_tos = ['xxx', '...']

    for y_to in y_tos:
        sheet_name_uni = y_to
        print(sheet_name_uni)

        df = pd.read_excel(input_file_path, sheet_name=sheet_name_nor)

        cols_to_pre = ['xxxxxxx', 'xxxxxx','...']

        missing_cols = [col for col in cols_to_pre if col not in df.columns]
        if missing_cols:
            print(f"{missing_cols} not found in the, skipping.")
            cols_to_pre = [col for col in cols_to_pre if col in df.columns]

        # load X and y
        # NOTE BorutaPy accepts numpy arrays only, hence the .values attribute
        X = df[cols_to_pre].values
        y = df[y_to].values

        np.int = np.int32
        np.float = np.float64
        np.bool = np.bool_

        # define random forest classifier, with utilising all cores and
        # sampling in proportion to y labels
        rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)

        # define Boruta feature selection method
        feat_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=1)

        # find all relevant features - 5 features should be selected
        feat_selector.fit(X, y)

        # # check selected features - first 5 features are selected
        # feat_selector.support_

        # # check ranking of features
        # feat_selector.ranking_

        # call transform() on X to filter it down to selected features
        # X_filtered = feat_selector.transform(X)
        selected_features = [cols_to_pre[i] for i, support in enumerate(feat_selector.support_) if support]

        print('Selected features: ', selected_features)
        print('Feature ranking: ', feat_selector.ranking_)

因为'feat_selector.support_' 放回的是一个布尔数组,当我们想打印出选出来的特征时直接打印不行,需要通过使用布尔索引来解决这个问题。

selected_features = [cols_to_pre[i] for i, support in enumerate(feat_selector.support_) if support]

上段代码遍历 cols_to_pre 列表,并且只选择 feat_selector.support_ 中为 True 的列。

posted @ 2024-03-25 22:19  ben犇  阅读(113)  评论(0编辑  收藏  举报