【Python】决策树算法(DecisionTreeClassifier)——东北大学数据挖掘实训三
1.利用决策树算法对train_feature.csv进行训练对test_feature.csv进行预测(练习调参),并计算预测正确的准确率。(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
import pandas as pd
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
import time,datetime
train_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\train_feature.csv")
train_df.head()
ip | app | device | os | channel | is_attributed | day | hour | minute | ip_count | app_count | device_count | os_count | channel_count | hour_count | minute_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 83230 | 3 | 1 | 13 | 379 | 0 | 2017-11-06 | 14 | 32 | 938 | 774123 | 6527713 | 1541988 | 101195 | 48 | 110457 |
1 | 17357 | 3 | 1 | 19 | 379 | 0 | 2017-11-06 | 14 | 33 | 677 | 774123 | 6527713 | 1644220 | 101195 | 48 | 112948 |
2 | 35810 | 3 | 1 | 13 | 379 | 0 | 2017-11-06 | 14 | 34 | 351 | 774123 | 6527713 | 1541988 | 101195 | 48 | 112532 |
3 | 45745 | 14 | 1 | 13 | 478 | 0 | 2017-11-06 | 14 | 34 | 7786 | 316214 | 6527713 | 1541988 | 11355 | 48 | 112532 |
4 | 161007 | 3 | 1 | 13 | 379 | 0 | 2017-11-06 | 14 | 35 | 132 | 774123 | 6527713 | 1541988 | 101195 | 48 | 115570 |
test_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\test_feature.csv")
test_df.head()
click_id | ip | app | device | os | channel | is_attributed | day | hour | minute | ip_count | app_count | device_count | os_count | channel_count | hour_count | minute_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 19870 | 2 | 1 | 13 | 435 | 0 | 2017-11-06 | 23 | 1 | 99 | 308059 | 2853433 | 657790 | 42678 | 2308568 | 68675 |
1 | 1 | 50314 | 15 | 1 | 17 | 265 | 0 | 2017-11-06 | 23 | 1 | 233 | 307505 | 2853433 | 153419 | 68057 | 2308568 | 68675 |
2 | 2 | 183513 | 15 | 1 | 13 | 153 | 0 | 2017-11-06 | 23 | 1 | 105 | 307505 | 2853433 | 657790 | 104935 | 2308568 | 68675 |
3 | 3 | 35731 | 12 | 1 | 19 | 178 | 0 | 2017-11-06 | 23 | 1 | 550 | 348786 | 2853433 | 765928 | 89744 | 2308568 | 68675 |
4 | 4 | 186444 | 12 | 1 | 3 | 265 | 0 | 2017-11-06 | 23 | 1 | 16 | 348786 | 2853433 | 45955 | 68057 | 2308568 | 68675 |
(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)
train_df["is_attributed"].value_counts()
0 6986725
1 13275
Name: is_attributed, dtype: int64
tmp_is1 = train_df[train_df['is_attributed']==1] #13275
tmp_is0 = train_df[train_df['is_attributed']==0] #6986725
tmp_is0 = tmp_is0.sample(n=tmp_is1.shape[0]*5)
train_df= tmp_is1.append(tmp_is0) #合并
删除‘day’列
print(train_df["day"].value_counts()) #都是同一天,没有用,删掉
print(test_df["day"].value_counts())
2017-11-06 79650
Name: day, dtype: int64
2017-11-06 2308568
2017-11-07 691432
Name: day, dtype: int64
train_df1=train_df.drop(['day'],axis=1)
test_df1=test_df.drop(['day'],axis=1)
test_df1=test_df1.drop(['click_id'],axis=1)
# 分割数据,取两个数据特征做为训练数据的特征,测试时发现如何将四个特征都做用起来,
# 准确率基本为 1,这样反而不方便调试了
y_train=train_df1[['is_attributed']].values
y_test=test_df1[['is_attributed']].values
x_train=train_df1.drop(['is_attributed'],axis=1)
x_test=test_df1.drop(['is_attributed'],axis=1)
# 初始化模型, max_depth 限制树的最大深度
# 可以使用"gini"或者"entropy",前者代表基尼系数,后者代表信息增益。
# 一般说使用默认的基尼系数"gini"就可以了,即CART算法。除非你更喜欢类似ID3, C4.5的最优特征选择方法。
clf = DecisionTreeClassifier()
# 训练模型
clf.fit(x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
print("训练数据的score:", (clf.score(x_train, y_train)))
print("测试数据的score:", (clf.score(x_test, y_test)))
训练数据的score: 0.9996359070935342
测试数据的score: 0.8894056666666666
predict=clf.predict(x_test) #预测
submission = pd.DataFrame ( {
'click_id':test_df['click_id'],
'is_attributed':predict
} )
# submission.to_csv('submission.csv',index=False)
说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
print("准确率:",sum((predict == 1) & (test_df.is_attributed==1)) / sum(test_df.is_attributed==1))
准确率: 0.7458654906284454
# 笨的求准确率的办法
# denominator=test_df[test_df.is_attributed==1].shape[0] #分母
# denominator
# test_df['is_attributed']=test_df['is_attributed'].replace(0,2)
# submission['is_attributed']=submission['is_attributed'].replace(0,3)
# molecule=(test_df['is_attributed']==submission['is_attributed']).sum() ##分子
# molecule
# Precision=molecule/denominator
# print(Precision) #准确率
大家好,我是[爱做梦的子浩](https://blog.csdn.net/weixin_43124279),我是东北大学大数据实验班大三的小菜鸡,非常向往优秀,羡慕优秀的人,已拿两个暑假offer,欢迎大家找我进行交流😂😂😂
这是我的博客地址:[子浩的博客https://blog.csdn.net/weixin_43124279]
——
版权声明:本文为CSDN博主「爱做梦的子浩」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。