【Python】随机森林算法——东北大学大数据班数据挖掘实训四

在这里插入图片描述

利用train.csv中的数据，通过H2O框架中的随机森林算法构建分类模型，然后利用模型对test.csv中的数据进行预测，并计算分类的准确度进而评价模型的分类效果；通过调节参数，观察分类准确度的变化情况。注：准确度＝预测正确的数与样本总数的比【注：可以做一些特征选择的工作，来提高准确度】

import  h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator 
from h2o.grid.grid_search import H2OGridSearch

h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.

H2O cluster uptime:	1 min 19 secs
H2O cluster timezone:	Asia/Shanghai
H2O data parsing timezone:	UTC
H2O cluster version:	3.28.0.1
H2O cluster version age:	16 days
H2O cluster name:	H2O_from_python_寮犲織娴4kdmlj
H2O cluster total nodes:	1
H2O cluster free memory:	3.512 Gb
H2O cluster total cores:	4
H2O cluster allowed cores:	4
H2O cluster status:	locked, healthy
H2O connection url:	http://localhost:54321
H2O connection proxy:	{'http': None, 'https': None}
H2O internal security:	False
H2O API Extensions:	Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:	3.7.4 final

train=h2o.import_file(path ="C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\train.csv")
test=h2o.import_file(path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\test.csv")

Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

train.head(5)

driver	trip	Average_speed	Average_ABS_Acceleration	Average_RPM	Variance_speed	Variance_ABS_Acceleration	Variance_RPM	v_a	v_b	v_c	v_d	a_a	a_b	a_c	r_a	r_b	r_c	Catrgory
4.10304e+10	1	6	0.218219	1209.08	33.4659	0.154504	242766	0.564121	0.224947	0.16328	0.047652	0.594954	0.288718	0.116328	0.585144	0.348283	0.066573	cluster2
4.10304e+10	2	3	0.305416	1064.18	24.5744	0.283866	185456	0.575369	0.291626	0.133005	0	0.57734	0.210837	0.211823	0.57734	0.365517	0.057143	cluster2
4.10304e+10	3	5	0.121377	1168.5	24.3105	0.012078	224469	0.574566	0.269364	0.156069	0	0.531792	0.393064	0.075145	0.56763	0.354913	0.077457	cluster2
4.10304e+10	4	7	0.185244	1175.39	41.511	0.323999	260512	0.498039	0.196078	0.214994	0.090888	0.685582	0.236217	0.078201	0.432757	0.505882	0.061361	cluster2
4.10304e+10	5	9	0.255851	1311.18	53.3696	0.440556	309292	0.39738	0.131823	0.318504	0.152293	0.543395	0.299945	0.156659	0.32369	0.60726	0.06905	cluster1

train.csv为训练数据集，该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver，trip这2列在构建模型时没有用

train=train[2:]# 删除driver trip 两个无用列
test=test[2:]# 删除driver trip 两个无用列

train.head(5)

Average_speed	Average_ABS_Acceleration	Average_RPM	Variance_speed	Variance_ABS_Acceleration	Variance_RPM	v_a	v_b	v_c	v_d	a_a	a_b	a_c	r_a	r_b	r_c	Catrgory
6	0.218219	1209.08	33.4659	0.154504	242766	0.564121	0.224947	0.16328	0.047652	0.594954	0.288718	0.116328	0.585144	0.348283	0.066573	cluster2
3	0.305416	1064.18	24.5744	0.283866	185456	0.575369	0.291626	0.133005	0	0.57734	0.210837	0.211823	0.57734	0.365517	0.057143	cluster2
5	0.121377	1168.5	24.3105	0.012078	224469	0.574566	0.269364	0.156069	0	0.531792	0.393064	0.075145	0.56763	0.354913	0.077457	cluster2
7	0.185244	1175.39	41.511	0.323999	260512	0.498039	0.196078	0.214994	0.090888	0.685582	0.236217	0.078201	0.432757	0.505882	0.061361	cluster2
9	0.255851	1311.18	53.3696	0.440556	309292	0.39738	0.131823	0.318504	0.152293	0.543395	0.299945	0.156659	0.32369	0.60726	0.06905	cluster1

1、直接建立模型，参数全部默认

准确率：0.8666666666666667

model1 = H2ORandomForestEstimator()  # 初始化（建立）模型
model1.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)  # 训练模型 train.names[0:-1]去除最后一列

drf Model Build progress: |███████████████████████████████████████████████| 100%

predict=H2ORandomForestEstimator.predict(model1 ,test[test.names[0:-1]]) # 对测试集进行预测  test[test.names[0:-1]]删除最后一列
predict.head(5)

drf prediction progress: |████████████████████████████████████████████████| 100%

predict	cluster0	cluster1	cluster2
cluster2	0.0204082	0	0.979592
cluster2	0.12963	0	0.87037
cluster2	0	0	1
cluster2	0	0	1
cluster1	0	1	0

注：准确度＝预测正确的数与样本总数的比

tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy

0.8666666666666667

查看模型深层信息，以获取对预测结果产生比较重要影响的特征

model1.deepfeatures

Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  DRF_model_python_1577882615850_1


Model Summary:

		number_of_trees	number_of_internal_trees	model_size_in_bytes	min_depth	max_depth	mean_depth	min_leaves	max_leaves	mean_leaves
0		50.0	150.0	59341.0	5.0	13.0	8.14	14.0	52.0	26.773333

ModelMetricsMultinomial: drf
** Reported on train data. **

MSE: 0.048564890251647425
RMSE: 0.22037443193720868
LogLoss: 0.16320718635092735
Mean Per-Class Error: 0.07050700819826967

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class

	cluster0	cluster1	cluster2	Error	Rate
0	138.0	1.0	14.0	0.098039	15 / 153
1	1.0	161.0	11.0	0.069364	12 / 173
2	6.0	6.0	260.0	0.044118	12 / 272
3	145.0	168.0	285.0	0.065217	39 / 598

Top-3 Hit Ratios:

	k	hit_ratio
0	1	0.934783
1	2	1.000000
2	3	1.000000

Scoring History:

	timestamp	duration	number_of_trees	training_rmse	training_logloss	training_classification_error
0	2020-01-01 20:45:33	0.049 sec	0.0	NaN	NaN	NaN
1	2020-01-01 20:45:34	0.383 sec	1.0	0.359650	3.811475	0.117391
2	2020-01-01 20:45:34	0.483 sec	2.0	0.342797	3.340081	0.105691
3	2020-01-01 20:45:34	0.515 sec	3.0	0.330296	3.012446	0.089862
4	2020-01-01 20:45:34	0.562 sec	4.0	0.320177	2.679887	0.089613
5	2020-01-01 20:45:34	0.587 sec	5.0	0.298609	2.080400	0.087361
6	2020-01-01 20:45:34	0.622 sec	6.0	0.281188	1.640286	0.083929
7	2020-01-01 20:45:34	0.653 sec	7.0	0.278461	1.430675	0.086655
8	2020-01-01 20:45:34	0.682 sec	8.0	0.269822	1.243377	0.090909
9	2020-01-01 20:45:34	0.703 sec	9.0	0.263806	1.178969	0.087179
10	2020-01-01 20:45:34	0.731 sec	10.0	0.250604	0.825163	0.078992
11	2020-01-01 20:45:34	0.753 sec	11.0	0.242310	0.759343	0.068562
12	2020-01-01 20:45:34	0.783 sec	12.0	0.239949	0.702918	0.070234
13	2020-01-01 20:45:34	0.803 sec	13.0	0.233250	0.482001	0.070234
14	2020-01-01 20:45:34	0.833 sec	14.0	0.229632	0.426821	0.061873
15	2020-01-01 20:45:34	0.863 sec	15.0	0.231505	0.429770	0.063545
16	2020-01-01 20:45:34	0.890 sec	16.0	0.229281	0.375294	0.066890
17	2020-01-01 20:45:34	0.919 sec	17.0	0.229443	0.375982	0.068562
18	2020-01-01 20:45:34	0.949 sec	18.0	0.229665	0.377334	0.068562
19	2020-01-01 20:45:34	0.974 sec	19.0	0.230373	0.379523	0.070234

See the whole table with table.as_data_frame()

Variable Importances:

	variable	relative_importance	scaled_importance	percentage
0	Average_speed	3703.256836	1.000000	0.245570
1	r_a	2256.470947	0.609321	0.149631
2	v_a	1821.382812	0.491833	0.120779
3	v_d	1685.737915	0.455204	0.111785
4	r_b	1604.149536	0.433173	0.106374
5	Average_RPM	1018.616333	0.275060	0.067546
6	v_c	668.664001	0.180561	0.044340
7	Variance_speed	553.771790	0.149536	0.036722
8	a_a	523.651306	0.141403	0.034724
9	v_b	439.868347	0.118779	0.029169
10	a_b	200.154129	0.054048	0.013273
11	r_c	155.026993	0.041862	0.010280
12	Variance_RPM	142.054703	0.038359	0.009420
13	a_c	121.158333	0.032717	0.008034
14	Average_ABS_Acceleration	113.996506	0.030783	0.007559
15	Variance_ABS_Acceleration	72.286301	0.019520	0.004793

<bound method ModelBase.deepfeatures of >

2、进行特征选择后建立模型，参数全部默认

挑选影响最大的八个特征对数据进行处理，按影响程度从大到小是

[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]

准确率：0.8666666666666667 没有变

train_features= train[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]
test_features= test[['Average_speed','r_a', 'r_b','Average_RPM','v_a','v_d','Variance_speed','v_c','Catrgory']]

### 进行特征选择后建立模型，参数默认
### 准确率：

model2 = H2ORandomForestEstimator()
model2.train(x = train_features.names[0:-1],y = 'Catrgory',training_frame = train_features)

drf Model Build progress: |███████████████████████████████████████████████| 100%

predict=H2ORandomForestEstimator.predict(model2 ,test_features[test_features.names[0:-1]])

drf prediction progress: |████████████████████████████████████████████████| 100%

tmp = predict[predict['predict'] == test_features['Catrgory']].nrow 
accuracy = tmp/test_features.nrow
accuracy

0.8666666666666667

3、通过调节参数，观察分类准确度的变化情况。

3.1、for循环调节参数（ntrees和max_depth）,得到最大准确率,寻找最佳参数

最大准确率：0.894

ntrees: 5

max_depth : 9

这部分太大，没有展示，从这里求得最优参数（ntrees和max_depth）

max_accuracy=0
ntrees=0
max_depth=0
for i in range(1,20):
    for j in range(1,20):
        model3=H2ORandomForestEstimator(ntrees=i,max_depth =j)
        model3.train(x=train.names[0:-1],y='Catrgory',training_frame=train)
        predict=H2ORandomForestEstimator.predict(model3 ,test[test.names[0:-1]])
        tmp = predict[predict['predict'] == test['Catrgory']].nrow 
        accuracy = tmp/test.nrow
        accuracy
        print("now acc is:", accuracy, "--- max acc is :",max_accuracy)
        if max_accuracy<accuracy:
            max_accuracy=accuracy
            ntrees=i
            max_depth=j

print("最大acc:",max_accuracy)
print("最优ntrees :",ntrees)
print("最优max_depth :",max_depth)

model3 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model3.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)

drf Model Build progress: |███████████████████████████████████████████████| 100%

predict=H2ORandomForestEstimator.predict(model3,test[test.names[0:-1]])

drf prediction progress: |████████████████████████████████████████████████| 100%

tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy

test数据与预测结果合并后的数据集，命名为predict.csv

out = test.concat(predict['predict'])
h2o.download_csv(out,"predict.csv")

'C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp4\\predict.csv'

3.2、Grid Search寻找最佳参数

准确率：0.8708333333333333

ntrees: 10

max_depth : 10

rf_params = {'ntrees': [x for x in range(30,60,1)],
                'max_depth': [x for x in range(10,20,1)]
               }
 
rf_grid = H2OGridSearch(model = H2ORandomForestEstimator,
                        hyper_params=rf_params)

rf_grid.train(x = train.names[0:-1],
               y = 'Catrgory',
               training_frame = train)

这部分太大，没有展示，从这里求得最优参数（ntrees和max_depth）

rfm_grid.show()

model4 = H2ORandomForestEstimator(ntrees=3,max_depth=6)
model4.train(x = train.names[0:-1],y = 'Catrgory',training_frame = train)

predict=H2ORandomForestEstimator.predict(model4,test[test.names[0:-1]])

tmp = predict[predict['predict'] == test['Catrgory']].nrow 
accuracy = tmp/test.nrow
accuracy

posted @ 2020-01-01 21:17 爱做梦的子浩阅读(800) 评论(0) 编辑收藏举报

刷新页面返回顶部

子浩的博客

【Python】随机森林算法——东北大学大数据班数据挖掘实训四

train.csv为训练数据集，该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver，trip这2列在构建模型时没有用

1、直接建立模型，参数全部默认

准确率：0.8666666666666667

注：准确度＝预测正确的数与样本总数的比

查看模型深层信息，以获取对预测结果产生比较重要影响的特征

2、进行特征选择后建立模型，参数全部默认

挑选影响最大的八个特征对数据进行处理，按影响程度从大到小是

[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]

准确率：0.8666666666666667 没有变

3、通过调节参数，观察分类准确度的变化情况。

3.1、for循环调节参数（ntrees和max_depth）,得到最大准确率,寻找最佳参数

最大准确率：0.894

ntrees: 5

max_depth : 9

这部分太大，没有展示，从这里求得最优参数（ntrees和max_depth）

test数据与预测结果合并后的数据集，命名为predict.csv

3.2、Grid Search寻找最佳参数

准确率：0.8708333333333333

ntrees: 10

max_depth : 10

这部分太大，没有展示，从这里求得最优参数（ntrees和max_depth）

公告