RandomForestClassifier

1、参数Parameters

n_estimators int, default=100

The number of trees in the forest.

Changed in version 0.22: The default value of n_estimators changed from 10 to 100 in 0.22.

太小,容易欠拟合,太大,计算量会太大,并且到一定的数量后,提升会很小,所以一般选择一个适中的数值

相对样本,特征数 怎么取值?

criterion{“gini”, “entropy”}, default=”gini”

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

 特征选择时评价特征的好坏,一般使用默认

max_depth int, default=None

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

如果模型样本量多,特征也多的情况下,推荐限制这个最大深度,一般推荐10-100; 数据量少,特征少,可以不修改

min_samples_split int or float, default=2       限制了子树继续划分的条件

The minimum number of samples required to split an internal node:

  • If int, then consider min_samples_split as the minimum number.

  • If float, then min_samples_split is a fraction and ceil(min_samples_split n_samples) are the minimum number of samples for each split.

Changed in version 0.18: Added float values for fractions.

如果样本量数量级非常大,则推荐增大这个值.数据量少,特征少,可以不修改

某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分

min_samples_leaf int or float, default=1  限制了叶子节点最少的样本数

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

  • If int, then consider min_samples_leaf as the minimum number.

  • If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf n_samples) are the minimum number of samples for each node.

Changed in version 0.18: Added float values for fractions.

如果样本量数量级非常大,则推荐增大这个值.数据量少,特征少,可以不修改

如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝

min_weight_fraction_leaffloat, default=0.0   限制了叶子节点所有样本权重和的最小值

The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

如果我们有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重,这时我们就要注意这个值了。

如果小于这个值,则会和兄弟节点一起被剪枝。 默认是0,就是不考虑权重问题

max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto”

The number of features to consider when looking for the best split:

  • If int, then consider max_features features at each split.

  • If float, then max_features is a fraction and round(max_features n_features) features are considered at each split.

  • If “auto”, then max_features=sqrt(n_features).

  • If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

  • If “log2”, then max_features=log2(n_features).

  • If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.

max_leaf_nodesint, default=None    限制最大叶子节点数

Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

如果特征不多,可以不考虑这个值,但是如果特征分成多的话,可以加以限制,具体的值可以通过交叉验证得到。

min_impurity_decreasefloat, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

The weighted impurity decrease equation is the following:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.

NN_tN_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.

New in version 0.19.

min_impurity_splitfloat, default=None  限制了决策树的增长

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

Deprecated since version 0.19: min_impurity_split has been deprecated in favor of min_impurity_decrease in 0.19. The default value of min_impurity_split has changed from 1e-7 to 0 in 0.23 and it will be removed in 1.0 (renaming of 0.25). Use min_impurity_decrease instead.

    如果某节点的不纯度(基于基尼系数,均方差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。一般不推荐改动默认值1e-7
bootstrapbool, default=True

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

oob_scorebool, default=False

Whether to use out-of-bag samples to estimate the generalization accuracy.

是否采用袋外样本来评估模型的好坏。默认识False。个人推荐设置为True,因为袋外分数反应了一个模型拟合后的泛化能力

n_jobsint, default=None

The number of jobs to run in parallel. fitpredictdecision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

按照资源设置

random_stateint, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features n_features). See Glossary for details.

随机种子,是在任意带有随机性的类或函数里作为参数来控制随机模式。当random_state取某一个值时,也就确定了一种规则。

其取值不变时,用相同的训练集建森林得到的结果一模一样,对测试集的预测结果也是一样的;

其值改变时,建森林得到的结果不同;

若不设置此参数,则函数会自动选择一种随机模式,每次得到的结果也就不同。

总结:在需要设置random_state的地方给其赋一个值,当多次运行此段代码能够得到完全一样的结果,别人运行此代码也可以复现你的过程。若不设置此参数则会随机选择一个种子,执行结果也会因此而不同了。虽然可以对random_state进行调参,但是调参后在训练集上表现好的模型未必在陌生训练集上表现好,所以一般会随便选取一个random_state的值作为参数

verboseint, default=0

Controls the verbosity when fitting and predicting.

warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. See the Glossary.

class_weight{“balanced”, “balanced_subsample”}, dict or list of dicts, default=None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples (n_classes np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

“另一种使随机森林更适合从极度不平衡的数据中学习的方法遵循成本敏感学习的思想。由于随机森林分类器往往偏向于多数分类器,因此对少数分类器的错误分类将受到更大的惩罚。我们为每个类分配一个权重,少数类的权重更大(即,较高的误分类成本)。类权值在两个地方被合并到随机森林算法中。在树的归纳过程中,类权值被用来对用于寻找分裂的基尼标准进行加权。在每棵树的终端节点中,再次考虑类权重。每个终端节点的类预测由“加权多数投票”决定;即。,类的加权投票是该类的权重乘以该类在终端节点上的案例数。然后,将每棵树的加权投票相加,确定随机森林的最终类别预测,其中权值为终端节点的平均权值。类权重是实现所需性能的一个基本调优参数。随机森林的自袋估计精度可以用来选择权重。该方法,加权随机森林(WRF),是纳入目前版本的软件”
ccp_alpha non-negative float, default=0.0

Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.

New in version 0.22.

max_samplesint or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

  • If None (default), then draw X.shape[0] samples.

  • If int, then draw max_samples samples.

  • If float, then draw max_samples X.shape[0] samples. Thus, max_samples should be in the interval (0, 1).

New in version 0.22.

2、 总结

如果样本量不大,特征不多,建议设置参数n_estimators , oob_scorebool, random_state,bootstrap(default)+max_samples

  •     oob_scorebool(袋外错误率):随机森林有一个重要的优点就是,没有必要对它进行交叉验证或者用一个独立的测试集来获得误差的一个无偏估计。它可以在内部进行评估,也就是说在生成的过程中就可以对误差建立一个无偏估计                                           

                     在构建每棵树时,我们对训练集使用了不同的bootstrap sample(随机且有放回地抽取)。所以对于每棵树而言(假设对于第k棵树),大约有1/3的训练实例没有参与第k棵树的生成,它们称为第k棵树的oob样本。

                 而这样的采样特点就允许我们进行oob估计,它的计算方式如下:(note:以样本为单位)

1)对每个样本,计算它作为oob样本的树对它的分类情况(约1/3的树);
2)然后以简单多数投票作为该样本的分类结果;
3)最后用误分个数占样本总数的比率作为随机森林的oob误分率。

                random_state:相同的random_state 在多次训练的结果相同

                bootstrap(default)+max_samples:为什么随机抽样训练集?

                    如果不进行随机抽样,每棵树的训练集都一样,那么最终训练出的树分类结果也是完全一样的,这样的话完全没有bagging的必要;

                   为什么要有放回地抽样?

                        如果不是有放回的抽样,那么每棵树的训练样本都是不同的,都是没有交集的,这样每棵树都是"有偏的",都是绝对"片面的"(当然这样说可能不对),也就是说每棵树训练出来都是有很大的差异的;而随机森林最后分类取决于多棵树(弱分类器)的投票表决,这种表决应该是"求                           同",因此使用完全不同的训练集来训练每棵树这样对最终分类结果是没有帮助的,这样无异于是"盲人摸象"。

如果样本量大,特征多,建议设置参数n_estimators ,oob_scorebool,random_state,max_depth ,min_samples_split ,min_samples_leaf,max_leaf_nodesint,min_impurity_splitfloat

 

2、网格法搜索超参

def random_forest_predict(train_x, train_y, test_x, test_y):
    param = {'n_estimators': range(10, 71, 10), 'max_depth':range(3,12,1),
             'max_samples':np.arange(0.5,1.05,0.1)}
    gsearch = GridSearchCV(estimator=RandomForestClassifier(max_features='sqrt',
                                                            oob_score=True,
                                                            random_state=10),
                            param_grid=param, scoring='roc_auc',
                            iid=False, cv=5, n_jobs=-1)
    gsearch.fit(train_x, train_y)
    print(gsearch.cv_results_, gsearch.best_params_, gsearch.best_score_)

  

最终模型为

def random_forest_predict1(train_x, train_y, test_x, test_y):
    clf = RandomForestClassifier(n_estimators=30, max_depth=7, max_samples=0.9, oob_score=True, random_state=30)
    clf.fit(train_x, train_y)
    print(clf.oob_score_)

    pre_y = clf.predict_proba(test_x)[:,1]
    # print(len(test_y))
    # print(pre_y.shape)
    # print(pre_y)
    print("AUC Score: %f" % metrics.roc_auc_score(test_y, pre_y))

    pre_y = clf.predict(test_x)
    # print("Accuracy: %f" % metrics.accuracy_score(test_y, pre_y))
    print("Precision score: %f" % metrics.precision_score(test_y, pre_y))
    print("Recall score: %f" % metrics.recall_score(test_y, pre_y))

 

3、模型可视化

def vision_model(clf):
feature_names = ["duration","numPktsIn","numPktsOut","bytesIn","bytesOut",
"avgIpt","avgIptIn","avgIptOut","encryptPkts",
"encryptPktsClass","encryptPkts","encrypt1Pkts",
"encrypt2Pkts","encrypt3Pkts"]
for idx, model in enumerate(clf.estimators_):
dot_name = "bruteforce_%d.dot" % idx
png_name = "bruteforce_%d.png" % idx
dot_data = tree.export_graphviz(model, out_file=None,
feature_names=feature_names)
# call(['dot', '-Tpng', os.path.join(os.getcwd(), dot_name), '-o', os.path.join(os.getcwd(), png_name), '-Gdpi=600'])
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_pdf("bruteforce_%d.pdf" % idx)

  运行的时候可能会有GraphViz's executables not found异常,pip install graphviz没有解决,解决方式是下载对应的安装包https://graphviz.gitlab.io/_pages/Download/Download_windows.html,安装的时候选上配置系统path,否则自己配置path;然后重启运行界面即可

posted @ 2021-02-19 10:45  哈哈哈喽喽喽  阅读(241)  评论(0编辑  收藏  举报