sklearn.ensemble.VotingClassifier


中文文档:Voting Classifier


1. Voting Classifier 原理

\(Voting\ Classifier\) 即用于分类的投票机制,分为硬投票和软投票。

voting = 'hard':硬投票。表示预测的类标签进行多数投票决定。

voting = 'soft':软投票。表示预测的类标签基于预测概率总和的最大值。


硬投票:

模型 \(1\)\(A-99\%、B-1\%\),表示模型 \(1\) 预测为 \(A\) 类的概率为 \(99\%\),为 \(B\) 类的概率为 \(1\%\)


软投票:


硬投票的缺点:最终的分类结果不是由概率大的类标签决定,可能由概率值较小但类标签数量多来决定



2. Voting Classifier 示例

class sklearn.ensemble.VotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)

参数:

  • estimators:模型组合为上面建立的 \(estimators\) 列表。
  • voting:投放方法。设置为 \(hard\),表示预测的类标签进行多数投票决定;设置为 \(soft\),表示预测的类标签基于预测概率总和的最大值。
  • weights:设置分类器对应的投票权重。这样可以将分类概率和权重做加权求和
  • n_jobs:默认是 \(1\)。设置为 \(-1\),表示计算时使用所有的 \(CPU\)

2.1 硬投票

硬投票:不区分模型的相对重要度,投票数最多的类为最终被预测的类。

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split


iris = datasets.load_iris()

x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
                     colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)
# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)],
                        voting='hard')

clf1.fit(x_train, y_train)

for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [SVM]
Accuracy: 0.93 (+/- 0.03) [Ensemble]

2.2 软投票

软投票:可为模型设置权重,区分模型的相对重要度。

from sklearn import datasets
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split


iris = datasets.load_iris()

x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)

clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
                     colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
                              min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)
# 软投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)],
                        voting='soft',
                        weights=[2, 1, 1])

clf1.fit(x_train, y_train)

for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
    scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [SVM]
Accuracy: 0.96 (+/- 0.02) [Ensemble]

2.3 与 GridSearchCV 结合

GridSearchCV 解析

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target

clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
                        voting='soft',
                        weights=[2, 1, 1])

params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]}

grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X, y)
print(grid.best_params_)        # 获得交叉检验模型得出的最优参数。
print(grid.best_estimator_)     # 获得交叉检验模型得出的最优模型对象。
{'lr__C': 1.0, 'rf__n_estimators': 20}
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=1)),
                             ('rf',
                              RandomForestClassifier(n_estimators=20,
                                                     random_state=1)),
                             ('gnb', GaussianNB())],
                 voting='soft', weights=[2, 1, 1])


posted @ 2022-05-12 15:04  做梦当财神  阅读(834)  评论(0编辑  收藏  举报