7.建立一个多类分类系统
7.建立一个多类分类系统
从规模化到特征提取、建模和评估,已经完成了简历分类系统的全部必要的步骤。现在将所有的东西组装在一起,应用到真实数据上以建立一个分类文本分类系统。对于此工作,将使用 scikit-learn 下载的 20 个新闻组数据集。这 20 个新闻组数据集包括分散在 20 个不同类别或主题的 18000 个新闻组帖子,这就构建了 20 类分类问题!请记住类的数量越多,尝试建立正确分类器就越复杂或者越困难。为防止模型因为文件头或者邮件地址而过拟合或泛化能力不强,体检的做法是从文档中去除文件头、文件尾和引用,因此需要确保考虑到了这一点。对于去除上述三项内容后的空文档或没用内容的文档,也将给予剔除,因为尝试从空文档中提取特征是毫无意义的。
开始下载所需的数据集以及为建立训练和测试数据集所用的函数:
from sklearn.datasets import fetch_20newsgroups ## 文档中使用的模块在高版本中会被剔除,根据提示替换模块解决问题 # from sklearn.cross_validation import train_test_split from sklearn.model_selection import train_test_split def get_data(): data = fetch_20newsgroups(subset = 'all' , shuffle = True , remove = ( 'headers' , 'footers' , 'quotes' )) return data def prepare_datasets(corpus, labels, test_data_proportion = 0.3 ): train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels, test_size = 0.33 , random_state = 42 ) return train_X, test_X, train_Y, test_Y def remove_empty_docs(corpus, labels): filtered_corpus = [] filtered_labels = [] for doc, label in zip (corpus, labels): if doc.strip(): filtered_corpus.append(doc) filtered_labels.append(label) return filtered_corpus, filtered_labels |
现在已经获得了数据,查看了数据集中分类的数量,使用下面的代码将数据集分为测试数据集和训练数据集。(下面代码执行下载数据集:)
In [ 20 ]: dataset = get_data() ...: print (dataset.target_names) ...: Downloading 20news dataset. This may take a few minutes. Downloading dataset from https: / / ndownloader.figshare.com / files / 5975967 ( 14 MB) [ 'alt.atheism' , 'comp.graphics' , 'comp.os.ms-windows.misc' , 'comp.sys.ibm.pc.hardware' , 'comp.sys.mac.hardware' , 'comp.windows.x' , 'misc.forsale' , 'rec.autos' , 'rec.motorcycles' , 'rec.sport.baseball' , 'rec.sport.hockey' , 'sci.crypt' , 'sci.electronics' , 'sci.med' , 'sci.space' , 'soc.religion.christian' , 'talk.politics.guns' , 'talk.politics.mideast' , 'talk.politics.misc' , 'talk.religion.misc' ] |
In [ 21 ]: corpus, labels = dataset.data, dataset.target ...: corpus, labels = remove_empty_docs(corpus, labels) ...: ...: print ( 'Sample document:' , corpus[ 10 ]) ...: print ( 'Class label:' ,labels[ 10 ]) ...: print ( 'Actual class label:' , dataset.target_names[labels[ 10 ]]) ...: ...: Sample document: the blood of the lamb. This will be a hard task, because most cultures used most animals for blood sacrifices. It has to be something related to our current post - modernism state. Hmm, what about used computers? Cheers, Kent Class label: 19 Actual class label: talk.religion.misc |
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus, labels, test_data_proportion = 0.3 ) |
从上面的代码可以看到文档和标签的情况。每个文档拥有自己的标签类,这些标签是需要进行分类的 20 个主题之一。这些标签是数字形式的,如果需要,可以使用上面的代码容易地将它们映射回原来的类别名字。已经把数据分为训练数据集和测试数据集,测试数据集占总数据的 30%。将使用训练数据集建立模型,使用测试数据集测试模型的性能。下面的代码将使用前面建立的规范化模块对数据集进行规范化处理:
from normalization import normalize_corpus norm_train_corpus = normalize_corpus(train_corpus) norm_test_corpus = normalize_corpus(test_corpus) |
执行语句可能会耗费一段时间才能完成。
如果出现类似错误:
... RuntimeError: generator raised StopIteration |
请切换至 Python3.6 或更高版本
记住,语料库中每个文档进行规范化处理需要很多步骤,所以这将会耗费一些时间才能完成。完成文档规范化处理后,将使用前面建立的特征提取模块从文档中提取特征。将分别建立词袋模型、TF-IDF 模型、平均词向量模型和 TF-IDF 加权平均词向量模型,并比较它们的性能。
下面的代码基于不同技术提取必要的特征:
from feature_extractors import bow_extractor, tfidf_extractor from feature_extractors import averaged_word_vectorizer from feature_extractors import tfidf_weighted_averaged_word_vectorizer import nltk import gensim # bag of words features bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus) bow_test_features = bow_vectorizer.transform(norm_test_corpus) # tfidf features tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus) tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus) # tokenize documents tokenized_train = [nltk.word_tokenize(text) for text in norm_train_corpus] tokenized_test = [nltk.word_tokenize(text) for text in norm_test_corpus] # build word2vec model model = gensim.models.Word2Vec(tokenized_train, size = 500 , window = 100 , min_count = 30 , sample = 1e - 3 ) # averaged word vector features avg_wv_train_features = averaged_word_vectorizer(corpus = tokenized_train, model = model, num_features = 500 ) avg_wv_test_features = averaged_word_vectorizer(corpus = tokenized_test, model = model, num_features = 500 ) # tfidf weighted averaged word vector features vocab = tfidf_vectorizer.vocabulary_ tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_train, tfidf_vectors = tfidf_train_features, tfidf_vocabulary = vocab, model = model, num_features = 500 ) tfidf_wv_test_features = tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_test, tfidf_vectors = tfidf_test_features, tfidf_vocabulary = vocab, model = model, num_features = 500 ) |
使用上面的特征提取器从文本文档中提取了全部必要的特征之后,基于前面讨论的四个指标,定义一个函数用来苹果分类模型,函数如下面代码段所示:
from sklearn import metrics import numpy as np def get_metrics(true_labels, predicted_labels): print ( 'Accuracy:' , np. round ( metrics.accuracy_score(true_labels, predicted_labels), 2 )) print ( 'Precision:' , np. round ( metrics.precision_score(true_labels, predicted_labels, average = 'weighted' ), 2 )) print ( 'Recall:' , np. round ( metrics.recall_score(true_labels, predicted_labels, average = 'weighted' ), 2 )) print ( 'F1 Score:' , np. round ( metrics.f1_score(true_labels, predicted_labels, average = 'weighted' ), 2 )) |
现在定义一个函数使用机器学习算法和训练数据来训练模型,使用训练的模型在测试数据上执行预测,接着使用上面的函数苹果模型预测性能:
def train_predict_evaluate_model(classifier, train_features, train_labels, test_features, test_labels): # build model classifier.fit(train_features, train_labels) # predict using model predictions = classifier.predict(test_features) # evaluate model prediction performance get_metrics(true_labels = test_labels, predicted_labels = predictions) return predictions |
现在进入了 2 个机器学习算法,开始使用已经提取的特征建立模型。将使用前面提到的 scikit-learn 引入必要的分类算法,以节省花费在重写代码的时间和精力上:
from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import SGDClassifier mnb = MultinomialNB() svm = SGDClassifier(loss = 'hinge' , n_iter = 100 ) |
现在下面的代码将使用多项式朴素贝叶斯和支持向量机以及全部不同类型的特征进行模型训练、预测和评估:
# Multinomial Naive Bayes with bag of words features mnb_bow_predictions = train_predict_evaluate_model(classifier = mnb, train_features = bow_train_features, train_labels = train_labels, test_features = bow_test_features, test_labels = test_labels) |
Accuracy: 0.67 Precision: 0.72 Recall: 0.67 F1 Score: 0.65 |
# Support Vector Machine with bag of words features svm_bow_predictions = train_predict_evaluate_model(classifier = svm, train_features = bow_train_features, train_labels = train_labels, test_features = bow_test_features, test_labels = test_labels) |
Accuracy: 0.61 Precision: 0.67 Recall: 0.61 F1 Score: 0.62 |
# Multinomial Naive Bayes with tfidf features mnb_tfidf_predictions = train_predict_evaluate_model(classifier = mnb, train_features = tfidf_train_features, train_labels = train_labels, test_features = tfidf_test_features, test_labels = test_labels) |
Accuracy: 0.72 Precision: 0.78 Recall: 0.72 F1 Score: 0.7 |
# Support Vector Machine with tfidf features svm_tfidf_predictions = train_predict_evaluate_model(classifier = svm, train_features = tfidf_train_features, train_labels = train_labels, test_features = tfidf_test_features, test_labels = test_labels) |
Accuracy: 0.77 Precision: 0.77 Recall: 0.77 F1 Score: 0.77 |
# Support Vector Machine with averaged word vector features svm_avgwv_predictions = train_predict_evaluate_model(classifier = svm, train_features = avg_wv_train_features, train_labels = train_labels, test_features = avg_wv_test_features, test_labels = test_labels) |
Accuracy: 0.56 Precision: 0.58 Recall: 0.56 F1 Score: 0.56 |
# Support Vector Machine with tfidf weighted averaged word vector features svm_tfidfwv_predictions = train_predict_evaluate_model(classifier = svm, train_features = tfidf_wv_train_features, train_labels = train_labels, test_features = tfidf_wv_test_features, test_labels = test_labels) |
Accuracy: 0.53 Precision: 0.58 Recall: 0.53 F1 Score: 0.52 |
使用不同类型的特征建立了 6 个模型,使用测试数据评估了模型的性能。从上面的结果可以看到使用 TF-IDF 特征的 SVM 模型获得了最好的结果,准确率、精确率、召回率和 F1 score 均为 77%。可以建立 SVM TF-IDF 模型的混淆矩阵,以便了解模型性能不好的具体分类的情况:
import pandas as pd cm = metrics.confusion_matrix(test_labels, svm_tfidf_predictions) pd.DataFrame(cm, index = range ( 0 , 20 ), columns = range ( 0 , 20 )) |
Out[ 47 ]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 156 3 0 1 1 0 2 3 4 1 4 4 2 4 5 34 3 7 7 22 1 1 224 9 7 8 14 8 0 2 1 0 2 5 4 4 1 4 0 3 0 2 1 20 221 18 9 18 8 1 0 0 0 3 5 2 2 2 1 1 2 0 3 1 11 25 223 9 4 9 2 1 1 1 2 6 3 1 0 1 0 0 0 4 0 4 7 15 228 6 5 2 3 1 0 3 9 3 3 1 1 0 1 0 5 0 21 18 1 2 272 0 1 1 0 0 0 4 3 1 0 0 1 0 0 6 0 2 7 11 12 1 270 10 3 2 1 1 10 1 4 0 2 1 1 0 7 1 5 2 2 2 3 4 246 19 1 3 2 10 3 2 0 4 3 3 1 8 3 1 0 4 2 2 5 27 252 3 4 2 1 4 1 3 2 2 4 0 9 1 1 1 0 2 3 5 3 6 278 12 2 1 1 2 4 2 0 1 0 10 0 0 0 0 0 0 1 3 2 4 282 1 2 1 4 1 0 1 1 0 11 3 5 3 3 1 2 2 2 2 3 0 259 6 2 0 1 5 2 5 0 12 1 6 6 15 7 2 13 10 8 4 4 2 212 3 5 1 1 1 0 1 13 2 4 0 1 3 4 3 0 2 0 1 1 7 267 4 2 3 0 4 0 14 0 5 3 0 2 4 2 5 4 1 2 0 8 3 264 2 4 1 3 1 15 11 1 0 0 1 1 0 0 4 1 3 2 1 7 5 292 4 4 2 4 16 4 1 0 0 0 4 2 1 7 2 2 11 3 2 4 2 227 3 13 3 17 6 0 1 0 1 3 0 2 3 2 4 6 1 3 1 6 5 259 10 2 18 9 1 2 1 0 1 2 1 5 3 3 7 0 9 6 4 33 7 165 3 19 21 5 0 1 0 2 3 3 7 2 1 1 0 11 3 57 21 7 3 65 |
从上表混淆矩阵上,可以看到很多类标签为 0 的文档被错误地分类到类标签 15 里面,同样对于类标签 18 的很多文档被错误地分类到类标签 16 里面。很多类标签 19 的文档被错误地分类到类型标签 15 里面。打印类型名字,可以看到如下输出:
In [ 48 ]: class_names = dataset.target_names ...: print (class_names[ 0 ], '->' , class_names[ 15 ]) ...: print (class_names[ 18 ], '->' , class_names[ 16 ]) ...: print (class_names[ 19 ], '->' , class_names[ 15 ]) ...: ...: alt.atheism - > soc.religion.christian talk.politics.misc - > talk.politics.guns talk.religion.misc - > soc.religion.christian |
从前面的输出可以看到错误分类与实际分类并没有显著的不同。Christian、religion 和 atheism 都是与商都和宗教存在有关的概念,可能会有相似的特征。杂项问题和强制都与政治有关,必然有相似的特征。可以使用下面的代码,进一步详细查看和分析被错误分类的问题:
import re num = 0 for document, label, predicted_label in zip (test_corpus, test_labels, svm_tfidf_predictions): if label = = 0 and predicted_label = = 15 : print ( 'Actual Label:' , class_names[label]) print ( 'Predicted Label:' , class_names[predicted_label]) print ( 'Document:-' ) print (re.sub( '\n' , ' ' , document)) print ("") num + = 1 if num = = 4 : break |
打印结果:
Actual Label: alt.atheism Predicted Label: soc.religion.christian Document: - I would like a list of Bible contadictions from those of you who dispite being free from Christianity are well versed in the Bible. Actual Label: alt.atheism Predicted Label: soc.religion.christian Document: - They spent quite a bit of time on the wording of the Constitution. They picked words whose meanings implied the intent. We have already looked in the dictionary to define the word. Isn 't this sufficient? But we were discussing it in relation to the death penalty. And, the Constitution need not define each of the words within. Anyone who doesn' t know what cruel is can look in the dictionary ( and we did). Actual Label: alt.atheism Predicted Label: soc.religion.christian Document: - Our Lord and Savior David Keresh has risen! He has been seen alive! Spread the word! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Actual Label: alt.atheism Predicted Label: soc.religion.christian Document: - "This is your god" ( from John Carpenter's "They Live," natch) |
num = 0 for document, label, predicted_label in zip (test_corpus, test_labels, svm_tfidf_predictions): if label = = 18 and predicted_label = = 16 : print ( 'Actual Label:' , class_names[label]) print ( 'Predicted Label:' , class_names[predicted_label]) print ( 'Document:-' ) print (re.sub( '\n' , ' ' , document)) print () num + = 1 if num = = 4 : break |
打印结果:
Actual Label: talk.politics.misc Predicted Label: talk.politics.guns Document: - After the initial gun battle was over, they had 50 days to come out peacefully. They had their high priced lawyer, and judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn 't come out (even after they negotiated coming out after the radio sermon) that doesn' t include the Davidians wanting to commit suicide / murder / general mayhem? Actual Label: talk.politics.misc Predicted Label: talk.politics.guns Document: - Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire. Today 's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies. At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today' s paper quotes the government as saying, no, they didn 't have a license. Today' s paper reports that a number of the bodies were found with shoulder weapons next to them, as if they had been using them while dying - - which doesn 't sound like the sort of action I would expect from a suicide. Our government lies, as it tries to cover over its incompetence and negligence. Why should I believe the FBI' s claims about anything else , when we can see that they are LYING? This system of government is beyond reform. Actual Label: talk.politics.misc Predicted Label: talk.politics.guns Document: - Well, for one thing most, if not all the Dividians (depending on whether they could show they acted in self - defense and there were no illegal weapons), could have gone on with their life as they were living it. No one was forcing them to give up their religion or even their legal weapons. The Dividians had survived a change in leadership before so even if Koresch himself would have been convicted and sent to jail, they still could have carried on. I don 't think the Dividians were insane, but I don' t see a reason for mass suicide ( if the fire was intentional set by some of the Dividians.) We also don 't know that, if the fire was intentionally set from inside, was it a generally know plan or was this something only an inner circle knew about, or was it something two or three felt they had to do with or without Koresch' s knowledge / blessing, etc.? I don't know much about Masada. Were some people throwing others over? Did mothers jump over with their babies in their arms? Actual Label: talk.politics.misc Predicted Label: talk.politics.guns Document: - rja@mahogany126.cray.com (Russ Anderson) writes... The fact is that Koresh and his followers involved themselves in a gun battle to control the Mt Carmel complex . That is not in dispute. From what I remember of the trial, the authories couldn 't reasonably establish who fired first, the big reason behind the aquittal. _____ _____ \\\\\\/ ___/___________________ Mitchell S Todd \\\\/ / _____/__________________________ ________________ \\/ / mst4298@zeus._____/.' . '.' . '.' . '.' . '.' . '.' . '_' _ '_/ \_____ \__ / / tamu.edu _____/.' . '.' . '.' . '.' . '.' . '.' . '.' _ '_/ \__________\__ / / _____/_' _ '_' _ '_' _ '_' _ '_' _ '_' _ '_' _'_ / \_ / / __________ / \ / ____ / \\\\\\ \\\\\\ |
可以看到是如何分析和查看错误分类的文档的,然后回到前面步骤,调整优化特征提取方法,通过删除特征的单词或调整单词权重来减少或增加影响程度。
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步