利用train.csv中的数据,通过H2O框架中的随机森林算法构建分类模型,然后利用模型对test.csv中的数据进行预测,并计算分类的准确度进而评价模型的分类效果;通过调节参数,观察分类准确度的变化情况。注:准确度=预测正确的数与样本总数的比【注:可以做一些特征选择的工作,来提高准确度】
import h2o
from h2o. estimators. random_forest import H2ORandomForestEstimator
from h2o. grid. grid_search import H2OGridSearch
h2o. init( )
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O cluster uptime:
1 min 19 secs
H2O cluster timezone:
Asia/Shanghai
H2O data parsing timezone:
UTC
H2O cluster version:
3.28.0.1
H2O cluster version age:
16 days
H2O cluster name:
H2O_from_python_寮犲織娴4kdmlj
H2O cluster total nodes:
1
H2O cluster free memory:
3.512 Gb
H2O cluster total cores:
4
H2O cluster allowed cores:
4
H2O cluster status:
locked, healthy
H2O connection url:
http://localhost:54321
H2O connection proxy:
{'http': None, 'https': None}
H2O internal security:
False
H2O API Extensions:
Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version:
3.7.4 final
train= h2o. import_file( path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\train.csv" )
test= h2o. import_file( path = "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\data4\\test.csv" )
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
train. head( 5 )
driver trip Average_speed Average_ABS_Acceleration Average_RPM Variance_speed Variance_ABS_Acceleration Variance_RPM v_a v_b v_c v_d a_a a_b a_c r_a r_b r_c Catrgory
4.10304e+10 1 6 0.218219 1209.08 33.4659 0.154504 242766 0.564121 0.224947 0.16328 0.047652 0.594954 0.288718 0.116328 0.585144 0.348283 0.066573 cluster2
4.10304e+10 2 3 0.305416 1064.18 24.5744 0.283866 185456 0.575369 0.291626 0.133005 0 0.57734 0.210837 0.211823 0.57734 0.365517 0.057143 cluster2
4.10304e+10 3 5 0.121377 1168.5 24.3105 0.012078 224469 0.574566 0.269364 0.156069 0 0.531792 0.393064 0.075145 0.56763 0.354913 0.077457 cluster2
4.10304e+10 4 7 0.185244 1175.39 41.511 0.323999 260512 0.498039 0.196078 0.214994 0.090888 0.685582 0.236217 0.078201 0.432757 0.505882 0.061361 cluster2
4.10304e+10 5 9 0.255851 1311.18 53.3696 0.440556 309292 0.39738 0.131823 0.318504 0.152293 0.543395 0.299945 0.156659 0.32369 0.60726 0.06905 cluster1
train.csv为训练数据集,该数据集是驾驶员行为识别聚类结果经处理后的数据。其中driver,trip这2列在构建模型时没有用
train= train[ 2 : ]
test= test[ 2 : ]
train. head( 5 )
Average_speed Average_ABS_Acceleration Average_RPM Variance_speed Variance_ABS_Acceleration Variance_RPM v_a v_b v_c v_d a_a a_b a_c r_a r_b r_c Catrgory
6 0.218219 1209.08 33.4659 0.154504 242766 0.564121 0.224947 0.16328 0.047652 0.594954 0.288718 0.116328 0.585144 0.348283 0.066573 cluster2
3 0.305416 1064.18 24.5744 0.283866 185456 0.575369 0.291626 0.133005 0 0.57734 0.210837 0.211823 0.57734 0.365517 0.057143 cluster2
5 0.121377 1168.5 24.3105 0.012078 224469 0.574566 0.269364 0.156069 0 0.531792 0.393064 0.075145 0.56763 0.354913 0.077457 cluster2
7 0.185244 1175.39 41.511 0.323999 260512 0.498039 0.196078 0.214994 0.090888 0.685582 0.236217 0.078201 0.432757 0.505882 0.061361 cluster2
9 0.255851 1311.18 53.3696 0.440556 309292 0.39738 0.131823 0.318504 0.152293 0.543395 0.299945 0.156659 0.32369 0.60726 0.06905 cluster1
1、直接建立模型,参数全部默认
准确率:0.8666666666666667
model1 = H2ORandomForestEstimator( )
model1. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model1 , test[ test. names[ 0 : - 1 ] ] )
predict. head( 5 )
drf prediction progress: |████████████████████████████████████████████████| 100%
predict cluster0 cluster1 cluster2
cluster2 0.0204082 0 0.979592
cluster2 0.12963 0 0.87037
cluster2 0 0 1
cluster2 0 0 1
cluster1 0 1 0
注:准确度=预测正确的数与样本总数的比
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
0.8666666666666667
查看模型深层信息,以获取对预测结果产生比较重要影响的特征
model1. deepfeatures
Model Details
=============
H2ORandomForestEstimator : Distributed Random Forest
Model Key: DRF_model_python_1577882615850_1
Model Summary:
number_of_trees
number_of_internal_trees
model_size_in_bytes
min_depth
max_depth
mean_depth
min_leaves
max_leaves
mean_leaves
0
50.0
150.0
59341.0
5.0
13.0
8.14
14.0
52.0
26.773333
ModelMetricsMultinomial: drf
** Reported on train data. **
MSE: 0.048564890251647425
RMSE: 0.22037443193720868
LogLoss: 0.16320718635092735
Mean Per-Class Error: 0.07050700819826967
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
cluster0
cluster1
cluster2
Error
Rate
0
138.0
1.0
14.0
0.098039
15 / 153
1
1.0
161.0
11.0
0.069364
12 / 173
2
6.0
6.0
260.0
0.044118
12 / 272
3
145.0
168.0
285.0
0.065217
39 / 598
Top-3 Hit Ratios:
k
hit_ratio
0
1
0.934783
1
2
1.000000
2
3
1.000000
Scoring History:
timestamp
duration
number_of_trees
training_rmse
training_logloss
training_classification_error
0
2020-01-01 20:45:33
0.049 sec
0.0
NaN
NaN
NaN
1
2020-01-01 20:45:34
0.383 sec
1.0
0.359650
3.811475
0.117391
2
2020-01-01 20:45:34
0.483 sec
2.0
0.342797
3.340081
0.105691
3
2020-01-01 20:45:34
0.515 sec
3.0
0.330296
3.012446
0.089862
4
2020-01-01 20:45:34
0.562 sec
4.0
0.320177
2.679887
0.089613
5
2020-01-01 20:45:34
0.587 sec
5.0
0.298609
2.080400
0.087361
6
2020-01-01 20:45:34
0.622 sec
6.0
0.281188
1.640286
0.083929
7
2020-01-01 20:45:34
0.653 sec
7.0
0.278461
1.430675
0.086655
8
2020-01-01 20:45:34
0.682 sec
8.0
0.269822
1.243377
0.090909
9
2020-01-01 20:45:34
0.703 sec
9.0
0.263806
1.178969
0.087179
10
2020-01-01 20:45:34
0.731 sec
10.0
0.250604
0.825163
0.078992
11
2020-01-01 20:45:34
0.753 sec
11.0
0.242310
0.759343
0.068562
12
2020-01-01 20:45:34
0.783 sec
12.0
0.239949
0.702918
0.070234
13
2020-01-01 20:45:34
0.803 sec
13.0
0.233250
0.482001
0.070234
14
2020-01-01 20:45:34
0.833 sec
14.0
0.229632
0.426821
0.061873
15
2020-01-01 20:45:34
0.863 sec
15.0
0.231505
0.429770
0.063545
16
2020-01-01 20:45:34
0.890 sec
16.0
0.229281
0.375294
0.066890
17
2020-01-01 20:45:34
0.919 sec
17.0
0.229443
0.375982
0.068562
18
2020-01-01 20:45:34
0.949 sec
18.0
0.229665
0.377334
0.068562
19
2020-01-01 20:45:34
0.974 sec
19.0
0.230373
0.379523
0.070234
See the whole table with table.as_data_frame()
Variable Importances:
variable
relative_importance
scaled_importance
percentage
0
Average_speed
3703.256836
1.000000
0.245570
1
r_a
2256.470947
0.609321
0.149631
2
v_a
1821.382812
0.491833
0.120779
3
v_d
1685.737915
0.455204
0.111785
4
r_b
1604.149536
0.433173
0.106374
5
Average_RPM
1018.616333
0.275060
0.067546
6
v_c
668.664001
0.180561
0.044340
7
Variance_speed
553.771790
0.149536
0.036722
8
a_a
523.651306
0.141403
0.034724
9
v_b
439.868347
0.118779
0.029169
10
a_b
200.154129
0.054048
0.013273
11
r_c
155.026993
0.041862
0.010280
12
Variance_RPM
142.054703
0.038359
0.009420
13
a_c
121.158333
0.032717
0.008034
14
Average_ABS_Acceleration
113.996506
0.030783
0.007559
15
Variance_ABS_Acceleration
72.286301
0.019520
0.004793
<bound method ModelBase.deepfeatures of >
2、进行特征选择后建立模型,参数全部默认
挑选影响最大的八个特征对数据进行处理,按影响程度从大到小是
[[‘Average_speed’,‘r_a’, ‘r_b’,‘Average_RPM’,‘v_a’,‘v_d’,‘Variance_speed’,‘v_c’,‘Catrgory’]]
准确率:0.8666666666666667 没有变
train_features= train[ [ 'Average_speed' , 'r_a' , 'r_b' , 'Average_RPM' , 'v_a' , 'v_d' , 'Variance_speed' , 'v_c' , 'Catrgory' ] ]
test_features= test[ [ 'Average_speed' , 'r_a' , 'r_b' , 'Average_RPM' , 'v_a' , 'v_d' , 'Variance_speed' , 'v_c' , 'Catrgory' ] ]
model2 = H2ORandomForestEstimator( )
model2. train( x = train_features. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train_features)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model2 , test_features[ test_features. names[ 0 : - 1 ] ] )
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[ predict[ 'predict' ] == test_features[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test_features. nrow
accuracy
0.8666666666666667
3、通过调节参数,观察分类准确度的变化情况。
3.1、for循环调节参数(ntrees和max_depth),得到最大准确率,寻找最佳参数
最大准确率:0.894
ntrees: 5
max_depth : 9
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
max_accuracy= 0
ntrees= 0
max_depth= 0
for i in range ( 1 , 20 ) :
for j in range ( 1 , 20 ) :
model3= H2ORandomForestEstimator( ntrees= i, max_depth = j)
model3. train( x= train. names[ 0 : - 1 ] , y= 'Catrgory' , training_frame= train)
predict= H2ORandomForestEstimator. predict( model3 , test[ test. names[ 0 : - 1 ] ] )
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
print ( "now acc is:" , accuracy, "--- max acc is :" , max_accuracy)
if max_accuracy< accuracy:
max_accuracy= accuracy
ntrees= i
max_depth= j
print ( "最大acc:" , max_accuracy)
print ( "最优ntrees :" , ntrees)
print ( "最优max_depth :" , max_depth)
model3 = H2ORandomForestEstimator( ntrees= 3 , max_depth= 6 )
model3. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
drf Model Build progress: |███████████████████████████████████████████████| 100%
predict= H2ORandomForestEstimator. predict( model3, test[ test. names[ 0 : - 1 ] ] )
drf prediction progress: |████████████████████████████████████████████████| 100%
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
test数据与预测结果合并后的数据集,命名为predict.csv
out = test. concat( predict[ 'predict' ] )
h2o. download_csv( out, "predict.csv" )
'C:\\Users\\zzh\\Desktop\\dataMiningExperment\\exp4\\predict.csv'
3.2、Grid Search寻找最佳参数
准确率:0.8708333333333333
ntrees: 10
max_depth : 10
rf_params = { 'ntrees' : [ x for x in range ( 30 , 60 , 1 ) ] ,
'max_depth' : [ x for x in range ( 10 , 20 , 1 ) ]
}
rf_grid = H2OGridSearch( model = H2ORandomForestEstimator,
hyper_params= rf_params)
rf_grid. train( x = train. names[ 0 : - 1 ] ,
y = 'Catrgory' ,
training_frame = train)
这部分太大,没有展示,从这里求得最优参数(ntrees和max_depth)
rfm_grid. show( )
model4 = H2ORandomForestEstimator( ntrees= 3 , max_depth= 6 )
model4. train( x = train. names[ 0 : - 1 ] , y = 'Catrgory' , training_frame = train)
predict= H2ORandomForestEstimator. predict( model4, test[ test. names[ 0 : - 1 ] ] )
tmp = predict[ predict[ 'predict' ] == test[ 'Catrgory' ] ] . nrow
accuracy = tmp/ test. nrow
accuracy
大家好,我是[爱做梦的子浩](https://blog.csdn.net/weixin_43124279),我是东北大学大数据实验班大三的小菜鸡,非常向往优秀,羡慕优秀的人,已拿两个暑假offer,欢迎大家找我进行交流😂😂😂
这是我的博客地址:[子浩的博客https://blog.csdn.net/weixin_43124279]
——
版权声明:本文为CSDN博主「爱做梦的子浩」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。