论文二次处理流程
第一步:
依据conf目录下的program.list文件在raw_data下面建立一个各个节目名称的文件夹
依据conf目录下的program_keywords文件在各个节目路径下面建立该节目对应的过滤词文件
第二步:
依据节目的过滤词从sina_weibo.data中根据每个节目下的若干个关键词依次进行过滤
得到对应的program.data文件格式为
提取到的字段为
微博id($2) 用户id($3) 创建时间($5) 转发($11) 评论($12) 赞($13) 内容($6)
以上两个步骤处理的完整脚本文件为:
第三步:
单独通过节目名称过滤的,保存在.title文件中(其实二三步可以合并)
第四部:抽取标注样本,对于总的.uinq文件以及所有的.title.uniq文件都打乱之后提取1000行.保存在$program.sample 和$program.title.sample文件中
第五步:复制以上smple文件到annotate文件中,为下一步可能进行的人工标注做准备。
上述流程的代码为:
!/bin/sh root_dir=/home/minelab/liweibo source_file=/home/minelab/cctv2014/data_warehouse/sina_weibo.data conf_dir=$root_dir/conf raw_dir=$root_dir/raw_data 在raw_data目录下面建立各个节目名称命名的文件夹,同时建立各个节目下的关键词文件 echo "make the program dir..." while read line do rm -rf $raw_dir/$line mkdir $raw_dir/$line cat $conf_dir/program_keywords | grep $line | awk -F'\t' '{for(i=1;i<=NF;i++) print $i}'> $raw_dir/$line/$line.filterwords echo $line" mkdir and get filter words is done@!" done < $conf_dir/program.list echo 'get the candidate tweet for each program filtering by the keywords...' program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.data rm -rf $raw_dir/$program/$program.uniq while read line do cat $source_file | grep $line | awk -F'\t' '{print $2"\t"$3"\t"$5"\t"$11"\t"$12"\t"$13"\t"$6}'>> $raw_dir/$program/$program.data done < $raw_dir/$program/$program.filterwords echo $program "filtering is done!" #去除链接以及文本去重 sed -i '1,$s/http:\/\/t\.cn\/[a-zA-Z0-9]\{4,9\}//g' $raw_dir/$program/$program.data echo $program "remove url is done..." cat $raw_dir/$program/$program.data | sort -t ' ' -k 7 | uniq -f 6 > $raw_dir/$program/$program.uniq echo $program "uniq is done ..." done echo "filter tweet by all words is done..." echo 'get the candidate tweet for each program filtering by the title...' program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.title rm rf $raw_dir/$program/$program.title.uniq cat $source_file | grep $program | awk -F'\t' '{print $2"\t"$3"\t"$5"\t"$11"\t"$12"\t"$13"\t"$6}' > $raw_dir/$program/$program.title echo $program "filtering is done!" #去除链接以及文本去重 sed -i '1,$s/http:\/\/t\.cn\/[a-zA-Z0-9]\{4,9\}//g' $raw_dir/$program/$program.title echo $program "remove url is done..." cat $raw_dir/$program/$program.title | sort -t ' ' -k 7 | uniq -f 6 > $raw_dir/$program/$program.title.uniq echo $program "uniq is done ..." done echo "preData is done..." echo "sample is begining ..." program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.sample rm -rf $raw_dir/$program/$program.title.sample cat $raw_dir/$program/$program.uniq | shuf | head -n 1000 > $raw_dir/$program/$program.sample cat $raw_dir/$program/$program.title.uniq | shuf | head -n 1000 > $raw_dir/$program/$program.title.sample echo $program "sampling is done..." done echo "statics start..." program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.data >> $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.uniq >> $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.title >> $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.title.uniq >> $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.sample>> $raw_dir/$program/$program.statistic wc -l $raw_dir/$program/$program.title.sample>> $raw_dir/$program/$program.statistic echo $program "statistic is done..." done echo "copy for annotate ..." program_list=`ls $raw_dir` for program in $program_list do rm -rf $raw_dir/$program/$program.sample.annotate rm -rf $raw_dir/$program/$program.title.sample.annotate cp $raw_dir/$program/$program.sample $raw_dir/$program/$program.sample.annotate cp $raw_dir/$program/$program.title.sample $raw_dir/$program/$program.title.sample.annotate echo $program "copy for annotate is done ..." done
训练集和测试集不同,计算平均值
以上实验baseline的python的脚本
#!/usr/python #!-*- coding=utf8-*- import os import os.path import sys root_dir='/media/新加卷__/小论文实验/data/liweibo' def init(): reload(sys) sys.setdefaultencoding('utf8') def traverseFile(dir_name,suffix_list,result_list,recursive=True): init() # print '加载路径为:'+ dir_name files=os.listdir(dir_name) for suffix in suffix_list: for file_name in files: full_name=dir_name+'/'+file_name if(os.path.isdir(full_name) & recursive): traverseFile(full_name,suffix_list,result_list,recursive) else: if(full_name.endswith(suffix)): result_list.append(full_name) return result_list def printlist(l): for i in range(len(l)): print l[i] if __name__=="__main__": result_list=list() traverseFile(root_dir,['.fenci'],result_list) for result in result_list: print result
#!/usr/python #!-*-coding=utf8-*- import numpy as np import random import myUtil from sklearn import cross_validation from sklearn import svm from sklearn import metrics from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer import pylab as pl def loadCorpusDataFromFile(in_file_name): label_list=list() corpus=list() with open(in_file_name) as in_file: for line in in_file: line_arr=line.strip().split('\t') if(len(line_arr)<3): continue label_list.append(int(line_arr[0])) corpus.append(line_arr[2]) return (corpus,label_list) def preDataByWordBag(in_file_name_list): v=CountVectorizer(min_df=1) corpus=list() label_list=list() for in_file_name in in_file_name_list: (cur_corpus,cur_label_list)=loadCorpusDataFromFile(in_file_name) corpus.extend(cur_corpus) label_list.extend(cur_label_list) data_list=v.fit_transform(corpus) label_list=np.array(label_list) return (data_list,label_list) def preDataByTfidf(in_file_name_list): v=TfidfVectorizer(min_df=1) corpus=list() label_list=list() for in_file_name in in_file_name_list: (cur_corpus,cur_label_list)=loadCorpusDataFromFile(in_file_name) corpus.extend(cur_corpus) label_list.extend(cur_label_list) data_list=v.fit_transform(corpus) label_list=np.array(label_list) return (data_list,label_list) ''' 人为指定训练集和测试集,这种情况无需交叉验证 data_train 训练集 data_test 测试集 classifier 分类器 ''' def trainModelAllocateTestData(data_train,data_test,classifier): print "start to trainModel..." x_train=data_train[0] y_train=data_train[1] x_test=data_test[0] y_test=data_test[1] n_samples,n_features=x_train.shape print "n_samples:"+str(n_samples)+"n_features:"+str(n_features) classifier.fit(x_train,y_train) y_true,y_pred=y_test,classifier.predict(x_test) precision=metrics.precision_score(y_true,y_pred) recall=metrics.recall_score(y_true,y_pred) accuracy=metrics.accuracy_score(y_true,y_pred) # accuracy=classifier.score(x[test],y_true) f=metrics.fbeta_score(y_true,y_pred,beta=1) probas_=classifier.predict_proba(x_test) fpr,tpr,thresholds=metrics.roc_curve(y_test,probas_[:,1]) roc_auc=metrics.auc(fpr,tpr) print("precision:%0.2f,recall:%0.2f,f:%0.2f,accuracy:%0.2f,roc_auc:%0.2f" % (precision,recall,f,accuracy,roc_auc)) #plot ROC curve pl.clf() pl.plot(fpr,tpr,label='ROC curve (area = %0.2f)' % roc_auc) pl.plot([0,1],[0,1],'k--') pl.xlim([0.0,1.0]) pl.ylim([0.0,1.0]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('receiver opetating characteristic example') pl.legend(loc='lower right') pl.show() return (precision,recall,f,accuracy,roc_auc) def trainModel(data,classifier,n_folds=5): print "start to trainModel..." x=data[0] y=data[1] #shupple samples n_samples,n_features=x.shape print "n_samples:"+str(n_samples)+"n_features:"+str(n_features) p=range(n_samples) random.seed(0) random.shuffle(p) x,y=x[p],y[p] #cross_validation cv=cross_validation.KFold(len(y),n_folds=5) mean_tpr=0.0 mean_fpr=np.linspace(0,1,100) mean_recall=0.0 mean_accuracy=0.0 mean_f=0.0 mean_precision=0.0 for i,(train,test) in enumerate(cv): print "the "+str(i)+"times validation..." classifier.fit(x[train],y[train]) y_true,y_pred=y[test],classifier.predict(x[test]) mean_precision+=metrics.precision_score(y_true,y_pred) mean_recall+=metrics.recall_score(y_true,y_pred) # mean_accuracy+=metrics.accuracy_score(y_true,y_pred) mean_accuracy+=classifier.score(x[test],y_true) mean_f+=metrics.fbeta_score(y_true,y_pred,beta=1) probas_=classifier.predict_proba(x[test]) fpr,tpr,thresholds=metrics.roc_curve(y[test],probas_[:,1]) mean_tpr+=np.interp(mean_fpr,fpr,tpr) mean_tpr[0]=0.0 roc_auc=metrics.auc(fpr,tpr) pl.plot(fpr,tpr,lw=1,label='ROC fold %d (area=%0.2f)'%(i,roc_auc)) pl.plot([0,1],[0,1],'--',color=(0.6,0.6,0.6),label='luck') mean_precision/=len(cv) mean_recall/=len(cv) mean_f/=len(cv) mean_accuracy/=len(cv) mean_tpr/=len(cv) mean_tpr[-1]=1.0 mean_auc=metrics.auc(mean_fpr,mean_tpr) print("mean_precision:%0.2f,mean_recall:%0.2f,mean_f:%0.2f,mean_accuracy:%0.2f,mean_auc:%0.2f " % (mean_precision,mean_recall,mean_f,mean_accuracy,mean_auc)) pl.plot(mean_fpr,mean_tpr,'k--',label='Mean ROC (area=%0.2f)'% mean_auc,lw=2) pl.xlim([-0.05,1.05]) pl.ylim([-0.05,1.05]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.title('ROC') pl.legend(loc="lower right") pl.show() def removeOneFeatureThenTrain(data,clf): x=data[0] y=data[1] n_samples,n_features=x.shape for i in range(n_features): print 'remove ' + str(i+1) + ' feture...' data_one=x[:,0:i] data_two=x[:,(i+1):n_features] data_leave=np.column_stack((data_one,data_two)) trainModel((data_leave,y),clf) def chooseSomeFeaturesThenTrain(data,clf,choose_index): x=data[0] y=data[1] (n_samples,n_features)=x.shape result_data=np.zeros(n_samples).reshape(n_samples,1) for i in choose_index: if i<1 or i > n_features: print 'error feature_index' return choose_column=x[:,(i-1)].reshape(n_samples,1) result_data=np.column_stack((result_data,choose_column)) result_data=(result_data[:,1:],y) trainModel(result_data,clf) def main(): #采用svm进行分类 clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) #采用自己提取的属性赋权重 # print "using my own weight strategory..." # data=preData() # trainModel(data,clf) # #采用word-bag赋权重 # print "using wordBag strategory..." # data=preDataByWordBag() # trainModel(data,clf) #采用tf-idf赋权重 # print "using tfidf strategory..." # data=preDataByTfidf() # trainModel(data,clf) #利用系统的10倍交叉验证功能 #data_list=data[0] #label_list=data[1] #scores=cross_validation.cross_val_score(clf,data_list,label_list,cv=5) #print scores #print("Accuracy:%0.2f(+/-%0.2f)"%(scores.mean(),scores.std()**2)) #每次去除一个属性进行判断 print "begin to remove one feature at one time..." #data=preData() #removeOneFeatureThenTrain(data,clf) #每次选择若干属性组合进行判断 print "begin to choose some features.." data=preData() n_samples,n_features=data[0].shape for i in range(1,n_features+1): chooseSomeFeaturesThenTrain(data,clf,[i]) root_dir='/media/新加卷_/小论文实验/data/liweibo/raw_data' ''' 加载所有的分词文件,通过.fenci作为文件过滤的标准 ''' def loadAllFenciFile(): file_list=list() #加载分词文件,准备分类器 myUtil.traverseFile(root_dir,['.fenci'],file_list) return file_list ''' 蒋所有的数据作为训练集 ''' def testAllFile(): file_list=loadAllFenciFile() clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) data=preDataByWordBag(file_list) trainModel(data,clf) ''' 单个节目逐个训练 ''' def testEachFile(): clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) file_list=loadAllFenciFile() for i in range(len(file_list)): if i==1: continue data=preDataByWordBag([file_list[i]]) trainModel(data,clf) def trainBySomeTestByOther(): clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) ambiguity_list=loadAllFenciFile() mean_precision=0.0 mean_recall=0 mean_f=0.0 mean_accuracy=0.0 mean_auc=0.0 program_num=len(ambiguity_list) for i in range(program_num): test_file=ambiguity_list[i] ambiguity_list.remove(test_file) ambiguity_list.append(test_file) print 'test_file:' print test_file print 'train_file:' myUtil.printlist(ambiguity_list) test_line=len(loadCorpusDataFromFile(test_file)[1]) data_all=preDataByWordBag(ambiguity_list) data_train=(data_all[0][0:-test_line],data_all[1][0:-test_line]) data_test=(data_all[0][-test_line:],data_all[1][-test_line:]) (precision,recall,f,accuracy,roc_auc)=trainModelAllocateTestData(data_train,data_test,clf) mean_precision+=precision mean_recall+=recall mean_f+=f mean_accuracy+=accuracy mean_auc+=roc_auc ambiguity_list=loadAllFenciFile() mean_precision/=program_num mean_recall/=program_num mean_f/=program_num mean_accuracy/=program_num mean_auc/=program_num print("the average result of train by some test by other is:") print("mean_precision:%0.2f,mean_recall:%0.2f,mean_f:%0.2f,mean_accuracy:%0.2f,mean_auc:%0.2f " % (mean_precision,mean_recall,mean_f,mean_accuracy,mean_auc)) #-------------------------利用自己提取的特征进行训练-------------------------------------- def loadMyDataForSingle(inFilePath): label_list=list() data_list=list() with open(inFilePath) as inFile: for line in inFile: lineArr=line.strip().split('\t') if(len(lineArr)!=8): continue label_list.append(int(lineArr[0])) data_list.append([float(lineArr[1]),float(lineArr[2]),float(lineArr[3]),float(lineArr[4]),float(lineArr[5]),float(lineArr[6]),float(lineArr[7])]) return (data_list,label_list) def loadMyDataForMany(inFilePathList): label_list=list() data_list=list() for inFilePath in inFilePathList: result=loadMyDataForSingle(inFilePath) label_list.extend(result[1]) data_list.extend(result[0]) return (np.array(data_list),np.array(label_list)) def loadAllMyFile(): file_list=list() myUtil.traverseFile(root_dir,['.result'],file_list) return file_list def trainAllByMine(): file_list=loadAllMyFile() data=loadMyDataForMany(file_list) clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) trainModel(data,clf) def trainSomeTestOtherByMine(): clf=svm.SVC(kernel='linear',C=1,probability=True,random_state=0) file_list=loadAllMyFile() mean_precision=0.0 mean_recall=0 mean_f=0.0 mean_accuracy=0.0 mean_auc=0.0 program_num=len(file_list) for i in range(program_num): test_file=file_list[i] file_list.remove(test_file) print 'test_file:' print test_file print 'train_file:' myUtil.printlist(file_list) data_train=loadMyDataForMany([test_file]) data_test=loadMyDataForMany(file_list) (precision,recall,f,accuracy,roc_auc)=trainModelAllocateTestData(data_train,data_test,clf) mean_precision+=precision mean_recall+=recall mean_f+=f mean_accuracy+=accuracy mean_auc+=roc_auc file_list=loadAllMyFile() mean_precision/=program_num mean_recall/=program_num mean_f/=program_num mean_accuracy/=program_num mean_auc/=program_num print("the average result of train by some test by other is:") if __name__=='__main__': #所有节目一起利用词袋参与训练 #testAllFile() #单个节目wordbag单独训练 #testEachFile() #利用歧义大的文件利用wordbag训练,但是测试集和训练集不同 #trainBySomeTestByOther() #利用自己的特征提取方法蒋测试集和训练集综合进行训练 #trainAllByMine() #利用自己的特征提取方法蒋测试集和训练集单独进行训练 trainSomeTestOtherByMine()
利用最简单的word-of-bag作为baseline,结果如下: 当训练集和测试集来源相同,采用五倍交叉验证 mean_precision:0.92,mean_recall:0.92,mean_f:0.92,mean_accuracy:0.90,mean_auc:0.96 测试集为团圆饭的时候: precision:0.49,recall:0.39,f:0.43,accuracy:0.62,roc_auc:0.62 测试集为我就这么个人的时候: precision:0.95,recall:0.81,f:0.88,accuracy:0.83,roc_auc:0.91 测试集为我的要求不算高的时候: precision:0.92,recall:0.31,f:0.47,accuracy:0.54,roc_auc:0.79 测试集为扶不扶的时候: precision:0.93,recall:0.87,f:0.90,accuracy:0.83,roc_auc:0.85 测试集为时间都去哪儿的时候: precision:0.84,recall:0.35,f:0.49,accuracy:0.58,roc_auc:0.69 测试集为说你什么好的时候: precision:0.59,recall:0.79,f:0.67,accuracy:0.66,roc_auc:0.76 以上各项平均可以得到: mean_precision:0.79,mean_recall:0.59,mean_f:0.64,mean_accuracy:0.68,mean_auc:0.77 使用自己的加权方法,测试集和训练集混合,得到的结果是: mean_precision:0.94,mean_recall:0.81,mean_f:0.87,mean_accuracy:0.85,mean_auc:0.88 蒋测试集和训练集分开,得到的结果是: 测试集为团圆饭的时候: precision:0.98,recall:0.71,f:0.83,accuracy:0.80,roc_auc:0.85 测试集为我就这么个人的时候: precision:0.90,recall:0.73,f:0.81,accuracy:0.80,roc_auc:0.81 测试集为我的要求不算高的时候: precision:0.74,recall:0.88,f:0.80,accuracy:0.74,roc_auc:0.86 测试集为扶不扶的时候(这个绝对拉低平均分呀,可以考虑去掉尝试): precision:0.55,recall:1.00,f:0.71,accuracy:0.55,roc_auc:0.54 测试集为时间都去哪儿的时候: precision:0.74,recall:0.97,f:0.84,accuracy:0.77,roc_auc:0.91 测试集为说你什么好的时候: precision:0.97,recall:0.75,f:0.85,accuracy:0.83,roc_auc:0.86 平均结果为: precision:0.97,recall:0.75,f:0.85,accuracy:0.83,roc_auc:0.86