sklearn.ensemble.VotingClassifier
1. Voting Classifier 原理
\(Voting\ Classifier\) 即用于分类的投票机制,分为硬投票和软投票。
voting = 'hard'
:硬投票。表示预测的类标签进行多数投票决定。
voting = 'soft'
:软投票。表示预测的类标签基于预测概率总和的最大值。
硬投票:
模型 \(1\):\(A-99\%、B-1\%\),表示模型 \(1\) 预测为 \(A\) 类的概率为 \(99\%\),为 \(B\) 类的概率为 \(1\%\)。
软投票:
硬投票的缺点:最终的分类结果不是由概率大的类标签决定,可能由概率值较小但类标签数量多来决定。
2. Voting Classifier 示例
class sklearn.ensemble.VotingClassifier(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)
参数:
estimators
:模型组合为上面建立的 \(estimators\) 列表。voting
:投放方法。设置为 \(hard\),表示预测的类标签进行多数投票决定;设置为 \(soft\),表示预测的类标签基于预测概率总和的最大值。weights
:设置分类器对应的投票权重。这样可以将分类概率和权重做加权求和n_jobs
:默认是 \(1\)。设置为 \(-1\),表示计算时使用所有的 \(CPU\)。
2.1 硬投票
硬投票:不区分模型的相对重要度,投票数最多的类为最终被预测的类。
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)
# 硬投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)],
voting='hard')
clf1.fit(x_train, y_train)
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [SVM]
Accuracy: 0.93 (+/- 0.03) [Ensemble]
2.2 软投票
软投票:可为模型设置权重,区分模型的相对重要度。
from sklearn import datasets
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
x=iris.data
y=iris.target
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3)
clf1 = XGBClassifier(learning_rate=0.1, n_estimators=150, max_depth=3, min_child_weight=2, subsample=0.8,
colsample_bytree=0.8, objective='binary:logistic')
clf2 = RandomForestClassifier(n_estimators=50, max_depth=1, min_samples_split=4,
min_samples_leaf=63,oob_score=True)
clf3 = SVC(C=0.1, probability=True)
# 软投票
eclf = VotingClassifier(estimators=[('xgb', clf1), ('rf', clf2), ('svc', clf3)],
voting='soft',
weights=[2, 1, 1])
clf1.fit(x_train, y_train)
for clf, label in zip([clf1, clf2, clf3, eclf], ['XGBBoosting', 'Random Forest', 'SVM', 'Ensemble']):
scores = cross_val_score(clf, x, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.96 (+/- 0.02) [XGBBoosting]
Accuracy: 0.33 (+/- 0.00) [Random Forest]
Accuracy: 0.92 (+/- 0.03) [SVM]
Accuracy: 0.96 (+/- 0.02) [Ensemble]
2.3 与 GridSearchCV 结合
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
X, y = iris.data[:, 1:3], iris.target
clf1 = LogisticRegression(random_state=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)],
voting='soft',
weights=[2, 1, 1])
params = {'lr__C': [1.0, 100.0], 'rf__n_estimators': [20, 200]}
grid = GridSearchCV(estimator=eclf, param_grid=params, cv=5)
grid = grid.fit(X, y)
print(grid.best_params_) # 获得交叉检验模型得出的最优参数。
print(grid.best_estimator_) # 获得交叉检验模型得出的最优模型对象。
{'lr__C': 1.0, 'rf__n_estimators': 20}
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=1)),
('rf',
RandomForestClassifier(n_estimators=20,
random_state=1)),
('gnb', GaussianNB())],
voting='soft', weights=[2, 1, 1])