7.建立一个多类分类系统

7.建立一个多类分类系统

从规模化到特征提取、建模和评估,已经完成了简历分类系统的全部必要的步骤。现在将所有的东西组装在一起,应用到真实数据上以建立一个分类文本分类系统。对于此工作,将使用 scikit-learn 下载的 20 个新闻组数据集。这 20 个新闻组数据集包括分散在 20 个不同类别或主题的 18000 个新闻组帖子,这就构建了 20 类分类问题!请记住类的数量越多,尝试建立正确分类器就越复杂或者越困难。为防止模型因为文件头或者邮件地址而过拟合或泛化能力不强,体检的做法是从文档中去除文件头、文件尾和引用,因此需要确保考虑到了这一点。对于去除上述三项内容后的空文档或没用内容的文档,也将给予剔除,因为尝试从空文档中提取特征是毫无意义的。

开始下载所需的数据集以及为建立训练和测试数据集所用的函数:

from sklearn.datasets import fetch_20newsgroups
## 文档中使用的模块在高版本中会被剔除,根据提示替换模块解决问题
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
 
def get_data():
    data = fetch_20newsgroups(subset='all',
                              shuffle=True,
                              remove=('headers''footers''quotes'))
    return data
     
def prepare_datasets(corpus, labels, test_data_proportion=0.3):
    train_X, test_X, train_Y, test_Y = train_test_split(corpus, labels,
                                                        test_size=0.33, random_state=42)
    return train_X, test_X, train_Y, test_Y
 
def remove_empty_docs(corpus, labels):
    filtered_corpus = []
    filtered_labels = []
    for doc, label in zip(corpus, labels):
        if doc.strip():
            filtered_corpus.append(doc)
            filtered_labels.append(label)
    return filtered_corpus, filtered_labels

现在已经获得了数据,查看了数据集中分类的数量,使用下面的代码将数据集分为测试数据集和训练数据集。(下面代码执行下载数据集:)

In [20]: dataset = get_data()
    ...: print(dataset.target_names)
    ...:
Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)
['alt.atheism''comp.graphics''comp.os.ms-windows.misc''comp.sys.ibm.pc.hardware''comp.sys.mac.hardware''comp.windows.x''misc.forsale''rec.autos''rec.motorcycles''rec.sport.baseball''rec.sport.hockey''sci.crypt''sci.electronics''sci.med''sci.space''soc.religion.christian''talk.politics.guns''talk.politics.mideast''talk.politics.misc''talk.religion.misc']
In [21]: corpus, labels = dataset.data, dataset.target
    ...: corpus, labels = remove_empty_docs(corpus, labels)
    ...:
    ...: print('Sample document:', corpus[10])
    ...: print('Class label:',labels[10])
    ...: print('Actual class label:', dataset.target_names[labels[10]])
    ...:
    ...:
Sample document: the blood of the lamb.
 
This will be a hard task, because most cultures used most animals
for blood sacrifices. It has to be something related to our current
post-modernism state. Hmm, what about used computers?
 
Cheers,
Kent
Class label: 19
Actual class label: talk.religion.misc
train_corpus, test_corpus, train_labels, test_labels = prepare_datasets(corpus,
                                                                        labels,
                                                                        test_data_proportion=0.3)

从上面的代码可以看到文档和标签的情况。每个文档拥有自己的标签类,这些标签是需要进行分类的 20 个主题之一。这些标签是数字形式的,如果需要,可以使用上面的代码容易地将它们映射回原来的类别名字。已经把数据分为训练数据集和测试数据集,测试数据集占总数据的 30%。将使用训练数据集建立模型,使用测试数据集测试模型的性能。下面的代码将使用前面建立的规范化模块对数据集进行规范化处理:

normalization.py 折叠源码
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 26 20:45:10 2016
@author: DIP
"""
 
from contractions import CONTRACTION_MAP
import re
import nltk
import string
from nltk.stem import WordNetLemmatizer
 
stopword_list = nltk.corpus.stopwords.words('english')
wnl = WordNetLemmatizer()
 
def tokenize_text(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    return tokens
 
def expand_contractions(text, contraction_mapping):
     
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())),
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                      
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
         
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text
     
     
from pattern.en import tag
from nltk.corpus import wordnet as wn
 
# Annotate text tokens with POS tags
def pos_tag_text(text):
     
    def penn_to_wn_tags(pos_tag):
        if pos_tag.startswith('J'):
            return wn.ADJ
        elif pos_tag.startswith('V'):
            return wn.VERB
        elif pos_tag.startswith('N'):
            return wn.NOUN
        elif pos_tag.startswith('R'):
            return wn.ADV
        else:
            return None
     
    tagged_text = tag(text)
    tagged_lower_text = [(word.lower(), penn_to_wn_tags(pos_tag))
                         for word, pos_tag in
                         tagged_text]
    return tagged_lower_text
     
# lemmatize text based on POS tags   
def lemmatize_text(text):
     
    pos_tagged_text = pos_tag_text(text)
    lemmatized_tokens = [wnl.lemmatize(word, pos_tag) if pos_tag
                         else word                    
                         for word, pos_tag in pos_tagged_text]
    lemmatized_text = ' '.join(lemmatized_tokens)
    return lemmatized_text
     
 
def remove_special_characters(text):
    tokens = tokenize_text(text)
    pattern = re.compile('[{}]'.format(re.escape(string.punctuation)))
    filtered_tokens = filter(None, [pattern.sub('', token) for token in tokens])
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text
     
     
def remove_stopwords(text):
    tokens = tokenize_text(text)
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)   
    return filtered_text
 
     
 
def normalize_corpus(corpus, tokenize=False):
     
    normalized_corpus = []   
    for text in corpus:
        text = expand_contractions(text, CONTRACTION_MAP)
        text = lemmatize_text(text)
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        normalized_corpus.append(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
             
    return normalized_corpus
from normalization import normalize_corpus
 
norm_train_corpus = normalize_corpus(train_corpus)
norm_test_corpus = normalize_corpus(test_corpus)

执行语句可能会耗费一段时间才能完成。

如果出现类似错误:

...
RuntimeError: generator raised StopIteration

请切换至 Python3.6 或更高版本

记住,语料库中每个文档进行规范化处理需要很多步骤,所以这将会耗费一些时间才能完成。完成文档规范化处理后,将使用前面建立的特征提取模块从文档中提取特征。将分别建立词袋模型、TF-IDF 模型、平均词向量模型和 TF-IDF 加权平均词向量模型,并比较它们的性能。

下面的代码基于不同技术提取必要的特征:

from feature_extractors import bow_extractor, tfidf_extractor
from feature_extractors import averaged_word_vectorizer
from feature_extractors import tfidf_weighted_averaged_word_vectorizer
import nltk
import gensim
 
# bag of words features
bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus) 
bow_test_features = bow_vectorizer.transform(norm_test_corpus)
 
# tfidf features
tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus) 
tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)   
 
 
# tokenize documents
tokenized_train = [nltk.word_tokenize(text)
                   for text in norm_train_corpus]
tokenized_test = [nltk.word_tokenize(text)
                   for text in norm_test_corpus] 
# build word2vec model                  
model = gensim.models.Word2Vec(tokenized_train,
                               size=500,
                               window=100,
                               min_count=30,
                               sample=1e-3)                 
                    
# averaged word vector features
avg_wv_train_features = averaged_word_vectorizer(corpus=tokenized_train,
                                                 model=model,
                                                 num_features=500)                  
avg_wv_test_features = averaged_word_vectorizer(corpus=tokenized_test,
                                                model=model,
                                                num_features=500)                                               
# tfidf weighted averaged word vector features
vocab = tfidf_vectorizer.vocabulary_
tfidf_wv_train_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_train,
                                                                  tfidf_vectors=tfidf_train_features,
                                                                  tfidf_vocabulary=vocab,
                                                                  model=model,
                                                                  num_features=500)
tfidf_wv_test_features = tfidf_weighted_averaged_word_vectorizer(corpus=tokenized_test,
                                                                 tfidf_vectors=tfidf_test_features,
                                                                 tfidf_vocabulary=vocab,
                                                                 model=model,
                                                                 num_features=500)

使用上面的特征提取器从文本文档中提取了全部必要的特征之后,基于前面讨论的四个指标,定义一个函数用来苹果分类模型,函数如下面代码段所示:

from sklearn import metrics
import numpy as np
 
def get_metrics(true_labels, predicted_labels):
     
    print('Accuracy:', np.round(
                        metrics.accuracy_score(true_labels,
                                               predicted_labels),
                        2))
    print('Precision:', np.round(
                        metrics.precision_score(true_labels,
                                               predicted_labels,
                                               average='weighted'),
                        2))
    print('Recall:', np.round(
                        metrics.recall_score(true_labels,
                                               predicted_labels,
                                               average='weighted'),
                        2))
    print('F1 Score:', np.round(
                        metrics.f1_score(true_labels,
                                               predicted_labels,
                                               average='weighted'),
                        2))

现在定义一个函数使用机器学习算法和训练数据来训练模型,使用训练的模型在测试数据上执行预测,接着使用上面的函数苹果模型预测性能:

def train_predict_evaluate_model(classifier,
                                 train_features, train_labels,
                                 test_features, test_labels):
    # build model   
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features)
    # evaluate model prediction performance  
    get_metrics(true_labels=test_labels,
                predicted_labels=predictions)
    return predictions

现在进入了 2 个机器学习算法,开始使用已经提取的特征建立模型。将使用前面提到的 scikit-learn 引入必要的分类算法,以节省花费在重写代码的时间和精力上:

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
 
mnb = MultinomialNB()
svm = SGDClassifier(loss='hinge', n_iter=100)

现在下面的代码将使用多项式朴素贝叶斯和支持向量机以及全部不同类型的特征进行模型训练、预测和评估:

# Multinomial Naive Bayes with bag of words features
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)
Accuracy: 0.67
Precision: 0.72
Recall: 0.67
F1 Score: 0.65
# Support Vector Machine with bag of words features
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)
Accuracy: 0.61
Precision: 0.67
Recall: 0.61
F1 Score: 0.62
# Multinomial Naive Bayes with tfidf features                                          
mnb_tfidf_predictions = train_predict_evaluate_model(classifier=mnb,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)
Accuracy: 0.72
Precision: 0.78
Recall: 0.72
F1 Score: 0.7
# Support Vector Machine with tfidf features
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)
Accuracy: 0.77
Precision: 0.77
Recall: 0.77
F1 Score: 0.77
# Support Vector Machine with averaged word vector features
svm_avgwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=avg_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=avg_wv_test_features,
                                           test_labels=test_labels)
Accuracy: 0.56
Precision: 0.58
Recall: 0.56
F1 Score: 0.56
# Support Vector Machine with tfidf weighted averaged word vector features
svm_tfidfwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=tfidf_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_wv_test_features,
                                           test_labels=test_labels)
Accuracy: 0.53
Precision: 0.58
Recall: 0.53
F1 Score: 0.52

使用不同类型的特征建立了 6 个模型,使用测试数据评估了模型的性能。从上面的结果可以看到使用 TF-IDF 特征的 SVM 模型获得了最好的结果,准确率、精确率、召回率和 F1 score 均为 77%。可以建立 SVM TF-IDF 模型的混淆矩阵,以便了解模型性能不好的具体分类的情况:

import pandas as pd
cm = metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index=range(0,20), columns=range(0,20))
Out[47]:
     0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15   16   17   18  19
0   156    3    0    1    1    0    2    3    4    1    4    4    2    4    5   34    3    7    7  22
1     1  224    9    7    8   14    8    0    2    1    0    2    5    4    4    1    4    0    3   0
2     1   20  221   18    9   18    8    1    0    0    0    3    5    2    2    2    1    1    2   0
3     1   11   25  223    9    4    9    2    1    1    1    2    6    3    1    0    1    0    0   0
4     0    4    7   15  228    6    5    2    3    1    0    3    9    3    3    1    1    0    1   0
5     0   21   18    1    2  272    0    1    1    0    0    0    4    3    1    0    0    1    0   0
6     0    2    7   11   12    1  270   10    3    2    1    1   10    1    4    0    2    1    1   0
7     1    5    2    2    2    3    4  246   19    1    3    2   10    3    2    0    4    3    3   1
8     3    1    0    4    2    2    5   27  252    3    4    2    1    4    1    3    2    2    4   0
9     1    1    1    0    2    3    5    3    6  278   12    2    1    1    2    4    2    0    1   0
10    0    0    0    0    0    0    1    3    2    4  282    1    2    1    4    1    0    1    1   0
11    3    5    3    3    1    2    2    2    2    3    0  259    6    2    0    1    5    2    5   0
12    1    6    6   15    7    2   13   10    8    4    4    2  212    3    5    1    1    1    0   1
13    2    4    0    1    3    4    3    0    2    0    1    1    7  267    4    2    3    0    4   0
14    0    5    3    0    2    4    2    5    4    1    2    0    8    3  264    2    4    1    3   1
15   11    1    0    0    1    1    0    0    4    1    3    2    1    7    5  292    4    4    2   4
16    4    1    0    0    0    4    2    1    7    2    2   11    3    2    4    2  227    3   13   3
17    6    0    1    0    1    3    0    2    3    2    4    6    1    3    1    6    5  259   10   2
18    9    1    2    1    0    1    2    1    5    3    3    7    0    9    6    4   33    7  165   3
19   21    5    0    1    0    2    3    3    7    2    1    1    0   11    3   57   21    7    3  65

从上表混淆矩阵上,可以看到很多类标签为 0 的文档被错误地分类到类标签 15 里面,同样对于类标签 18 的很多文档被错误地分类到类标签 16 里面。很多类标签 19 的文档被错误地分类到类型标签 15 里面。打印类型名字,可以看到如下输出:

In [48]: class_names = dataset.target_names
    ...: print(class_names[0], '->', class_names[15])
    ...: print(class_names[18], '->', class_names[16])
    ...: print(class_names[19], '->', class_names[15])
    ...:
    ...:
alt.atheism -> soc.religion.christian
talk.politics.misc -> talk.politics.guns
talk.religion.misc -> soc.religion.christian

从前面的输出可以看到错误分类与实际分类并没有显著的不同。Christian、religion 和 atheism 都是与商都和宗教存在有关的概念,可能会有相似的特征。杂项问题和强制都与政治有关,必然有相似的特征。可以使用下面的代码,进一步详细查看和分析被错误分类的问题:

import re
 
num = 0
for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
    if label == 0 and predicted_label == 15:
        print('Actual Label:', class_names[label])
        print('Predicted Label:', class_names[predicted_label])
        print('Document:-')
        print(re.sub('\n'' ', document))
        print("")
        num += 1
        if num == 4:
            break

打印结果:

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
I would like a list of Bible contadictions from those of you who dispite being free from Christianity are well versed in the Bible.
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
  They spent quite a bit of time on the wording of the Constitution.  They picked words whose meanings implied the intent.  We have already looked in the dictionary to define the word.  Isn't this sufficient?   But we were discussing it in relation to the death penalty.  And, the Constitution need not define each of the words within.  Anyone who doesn't know what cruel is can look in the dictionary (and we did).
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
Our Lord and Savior David Keresh has risen!     He has been seen alive!     Spread the word!     --------------------------------------------------------------------------------
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:-
  "This is your god" (from John Carpenter's "They Live," natch)
num = 0
for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
    if label == 18 and predicted_label == 16:
        print('Actual Label:', class_names[label])
        print('Predicted Label:', class_names[predicted_label])
        print('Document:-')
        print(re.sub('\n'' ', document))
        print()
        num += 1
        if num == 4:
            break

打印结果:

Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
After the initial gun battle was over, they had 50 days to come out peacefully. They had their high priced lawyer, and judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn't come out (even after they negotiated coming out after the radio sermon) that doesn't include the Davidians wanting to commit suicide/murder/general mayhem?
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire.  Today's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies.  At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today's paper quotes the government as saying, no, they didn't have a license.  Today's paper reports that a number of the bodies were found with shoulder weapons next to them, as if they had been using them while dying -- which doesn't sound like the sort of action I would expect from a suicide.  Our government lies, as it tries to cover over its incompetence and negligence.  Why should I believe the FBI's claims about anything else, when we can see that they are LYING?  This system of government is beyond reform.
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
  Well, for one thing most, if not all the Dividians (depending on whether they could show they acted in self-defense and there were no illegal weapons), could have gone on with their life as they were living it. No one was forcing them to give up their religion or even their legal weapons. The Dividians had survived a change in leadership before so even if Koresch himself would have been convicted and sent to jail, they still could have carried on.   I don't think the Dividians were insane, but I don't see a reason for mass suicide (if the fire was intentional set by some of the Dividians.) We also don't know that, if the fire was intentionally set from inside, was it a generally know plan or was this something only an inner circle knew about, or was it something two or three felt they had to do with or without Koresch's knowledge/blessing, etc.? I don't know much about Masada. Were some people throwing others over? Did mothers jump over with their babies in their arms?
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:-
rja@mahogany126.cray.com (Russ Anderson) writes...      The fact is that Koresh and his followers involved themselves   in a gun battle to control the Mt Carmel complex. That is not   in dispute. From what I remember of the trial, the authories    couldn't reasonably establish who fired first, the big reason   behind the aquittal.                    _____  _____                   \\\\\\/ ___/___________________   Mitchell S Todd  \\\\/ /                 _____/__________________________ ________________    \\/ / mst4298@zeus._____/.'.'.'.'.'.'.'.'.'.'.'.'_'_'_/ \_____        \__    / / tamu.edu  _____/.'.'.'.'.'.'.'.'.'.'.'.'.'_'_/     \__________\__  / /        _____/_'_'_'_'_'_'_'_'_'_'_'_'_'_'_/                 \_ / /__________/                  \/____/\\\\\\              \\\\\\

可以看到是如何分析和查看错误分类的文档的,然后回到前面步骤,调整优化特征提取方法,通过删除特征的单词或调整单词权重来减少或增加影响程度。

posted @ 2019-08-14 18:38  翡翠嫩白菜  阅读(820)  评论(0编辑  收藏  举报