中文短文本分类

文本分类，属于有监督学习中的一部分，在很多场景下都有应用，下面通过小数据的实例，一步步完成中文短文本的分类实现，整个过程尽量做到少理论重实战。

下面使用的数据是一份司法数据，需求是对每一条输入数据，判断事情的主体是谁，比如报警人被老公打，报警人被老婆打，报警人被儿子打，报警人被女儿打等来进行文本有监督的分类操作。

整个过程分为以下几个步骤：

语料加载
分词
去停用词
抽取词向量特征
分别进行算法建模和模型训练
评估、计算 AUC 值
模型对比

基本流程如下图所示：

enter image description here

下面开始项目实战。

1. 首先进行语料加载，在这之前，引入所需要的 Python 依赖包，并将全部语料和停用词字典读入内存中。

第一步，引入依赖库，有随机数库、jieba 分词、pandas 库等：

import random
import jieba
import pandas as pd

第二步，加载停用词字典，停用词词典为 stopwords.txt 文件，可以根据场景自己在该文本里面添加要去除的词（比如冠词、人称、数字等特定词）：

#加载停用词
stopwords=pd.read_csv('stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

第三步，加载语料，语料是4个已经分好类的 csv 文件，直接用 pandas 加载即可，加载之后可以首先删除 nan 行，并提取要分词的 content 列转换为 list 列表：

# 加载语料
laogong_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',')
laopo_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',')
erzi_df = pd.read_csv('beierzida.csv', encoding='utf-8', sep=',')
nver_df = pd.read_csv('beinverda.csv', encoding='utf-8', sep=',')
# 删除语料的nan行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
# 转换
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()

2. 分词和去停用词。

第一步，定义分词、去停用词和批量打标签的函数，函数包含3个参数：content_lines 参数为语料列表；sentences 参数为预先定义的 list，用来存储分词并打标签后的结果；category 参数为标签：

# 定义分词和打标签函数preprocess_text
# 参数content_lines即为上面转换的list
# 参数sentences是定义的空list，用来储存打标签之后的数据
# 参数category 是类型标签
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs = jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]  # 去数字
            segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
            segs = list(filter(lambda x: len(x) > 1, segs))  # 长度为1的字符
            segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用词
            sentences.append((" ".join(segs), category))  # 打标签
        except Exception:
            print(line)
            continue

第二步，调用函数、生成训练数据，根据我提供的司法语料数据，分为报警人被老公打，报警人被老婆打，报警人被儿子打，报警人被女儿打，标签分别为0、1、2、3，具体如下：

sentences = []
preprocess_text(laogong, sentences,0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)

第三步，将得到的数据集打散，生成更可靠的训练集分布，避免同类数据分布不均匀：

random.shuffle(sentences)

第四步，我们在控制台输出前10条数据，观察一下：

for sentence in sentences[:10]:
　　print(sentence[0], sentence[1])  #下标0是词列表，1是标签

得到的结果如图所示：

enter image description here

3. 抽取词向量特征。

第一步，抽取特征，我们定义文本抽取词袋模型特征：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
　　analyzer='word', # tokenise by character ngrams
　　max_features=4000,  # keep the most common 1000 ngrams
    )

第二步，把语料数据切分，用 sk-learn 对数据切分，分成训练集和测试集：

from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)

第三步，把训练数据转换为词袋模型：

vec.fit(x_train)

4. 分别进行算法建模和模型训练。

定义朴素贝叶斯模型，然后对训练集进行模型训练，直接使用 sklearn 中的 MultinomialNB：

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

5. 评估、计算 AUC 值。

第一步，上面步骤1-4完成了从语料到模型的训练，训练之后，我们要用测试集来计算 AUC 值：

print(classifier.score(vec.transform(x_test), y_test))

得到的结果评分为：0.647331786543。

第二步，进行测试集的预测：

 pre = classifier.predict(vec.transform(x_test))

6. 模型对比。

整个模型从语料到训练评估步骤1-5就完成了，接下来我们来看看，改变特征向量模型和训练模型对结果有什么变化。

（1）改变特征向量模型

下面可以把特征做得更强一点，尝试加入抽取 2-gram 和 3-gram 的统计特征，把词库的量放大一点。

   from sklearn.feature_extraction.text import CountVectorizer
    vec = CountVectorizer(
        analyzer='word', # tokenise by character ngrams
        ngram_range=(1,4),  # use ngrams of size 1 and 2
        max_features=20000,  # keep the most common 1000 ngrams
    )
    vec.fit(x_train)
    #用朴素贝叶斯算法进行模型训练
    classifier = MultinomialNB()
    classifier.fit(vec.transform(x_train), y_train)
    #对结果进行评分
    print(classifier.score(vec.transform(x_test), y_test))

得到的结果评分为：0.649651972158，确实有一点提高，但是不太明显。

（2）改变训练模型

使用 SVM 训练：

from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(vec.transform(x_train), y_train)
print(svm.score(vec.transform(x_test), y_test))

使用决策树、随机森林、XGBoost、神经网络等等：

import xgboost as xgb  
from sklearn.model_selection import StratifiedKFold  
import numpy as np
# xgb矩阵赋值  
xgb_train = xgb.DMatrix(vec.transform(x_train), label=y_train)  
xgb_test = xgb.DMatrix(vec.transform(x_test))

在 XGBoost 中，下面主要是调参指标，可以根据参数进行调参：

    params = {  
            'booster': 'gbtree',     #使用gbtree
            'objective': 'multi:softmax',  # 多分类的问题、  
            # 'objective': 'multi:softprob',   # 多分类概率  
            #'objective': 'binary:logistic',  #二分类
            'eval_metric': 'merror',   #logloss
            'num_class': 4,  # 类别数，与 multisoftmax 并用  
            'gamma': 0.1,  # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。  
            'max_depth': 8,  # 构建树的深度，越大越容易过拟合  
            'alpha': 0,   # L1正则化系数  
            'lambda': 10,  # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。  
            'subsample': 0.7,  # 随机采样训练样本  
            'colsample_bytree': 0.5,  # 生成树时进行的列采样  
            'min_child_weight': 3,  
            # 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言  
            # 假设 h 在 0.01 附近，min_child_weight 为 1 叶子节点中最少需要包含 100 个样本。  
            'silent': 0,  # 设置成1则没有运行信息输出，最好是设置为0.  
            'eta': 0.03,  # 如同学习率  
            'seed': 1000,  
            'nthread': -1,  # cpu 线程数  
            'missing': 1 
        }

总结

上面通过真实司法数据，一步步实现中文短文本分类的方法，整个示例代码可以当做模板来用，从优化和提高模型准确率来说，主要有两方面可以尝试：

特征向量的构建，除了词袋模型，可以考虑使用 word2vec 和 doc2vec 等；
模型上可以选择有监督的分类算法、集成学习以及神经网络等。

  1 import random
  2 import jieba
  3 import pandas as pd
  4 
  5 #加载停用词
  6 stopwords=pd.read_csv('./data6/stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
  7 stopwords=stopwords['stopword'].values
  8 
  9 # 加载语料
 10 laogong_df = pd.read_csv('./data6/beilaogongda.csv', encoding='utf-8', sep=',')
 11 laopo_df = pd.read_csv('./data6/beilaogongda.csv', encoding='utf-8', sep=',')
 12 erzi_df = pd.read_csv('./data6/beierzida.csv', encoding='utf-8', sep=',')
 13 nver_df = pd.read_csv('./data6/beinverda.csv', encoding='utf-8', sep=',')
 14 # 删除语料的nan行
 15 laogong_df.dropna(inplace=True)
 16 laopo_df.dropna(inplace=True)
 17 erzi_df.dropna(inplace=True)
 18 nver_df.dropna(inplace=True)
 19 # 转换
 20 laogong = laogong_df.segment.values.tolist()
 21 laopo = laopo_df.segment.values.tolist()
 22 erzi = erzi_df.segment.values.tolist()
 23 nver = nver_df.segment.values.tolist()
 24 
 25 # 定义分词和打标签函数preprocess_text
 26 # 参数content_lines即为上面转换的list
 27 # 参数sentences是定义的空list，用来储存打标签之后的数据
 28 # 参数category 是类型标签
 29 def preprocess_text(content_lines, sentences, category):
 30     for line in content_lines:
 31         try:
 32             segs = jieba.lcut(line)
 33             segs = [v for v in segs if not str(v).isdigit()]  # 去数字
 34             segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
 35             segs = list(filter(lambda x: len(x) > 1, segs))  # 长度为1的字符
 36             segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用词
 37             sentences.append((" ".join(segs), category))  # 打标签
 38         except Exception:
 39             print(line)
 40             continue
 41 
 42 sentences = []
 43 preprocess_text(laogong, sentences,0)
 44 preprocess_text(laopo, sentences, 1)
 45 preprocess_text(erzi, sentences, 2)
 46 preprocess_text(nver, sentences, 3)
 47 
 48 random.shuffle(sentences)
 49 
 50 for sentence in sentences[:10]:
 51     print(sentence[0], sentence[1])  #下标0是词列表，1是标签
 52 
 53 from sklearn.feature_extraction.text import CountVectorizer
 54 vec = CountVectorizer(
 55     analyzer='word', # tokenise by character ngrams
 56     max_features=4000,  # keep the most common 1000 ngrams
 57     )
 58 
 59 from sklearn.model_selection import train_test_split
 60 x, y = zip(*sentences)
 61 x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)
 62 
 63 vec.fit(x_train)
 64 
 65 from sklearn.naive_bayes import MultinomialNB
 66 classifier = MultinomialNB()
 67 classifier.fit(vec.transform(x_train), y_train)
 68 print(classifier.score(vec.transform(x_test), y_test))
 69 
 70 pre = classifier.predict(vec.transform(x_test))
 71 
 72 from sklearn.feature_extraction.text import CountVectorizer
 73 vec = CountVectorizer(
 74         analyzer='word', # tokenise by character ngrams
 75         ngram_range=(1,4),  # use ngrams of size 1 and 2
 76         max_features=20000,  # keep the most common 1000 ngrams
 77     )
 78 vec.fit(x_train)
 79 #用朴素贝叶斯算法进行模型训练
 80 classifier = MultinomialNB()
 81 classifier.fit(vec.transform(x_train), y_train)
 82 #对结果进行评分
 83 print(classifier.score(vec.transform(x_test), y_test))
 84 
 85 print("------------------------------")
 86 from sklearn.svm import SVC
 87 svm = SVC(kernel='linear')
 88 svm.fit(vec.transform(x_train), y_train)
 89 print(svm.score(vec.transform(x_test), y_test))
 90 
 91 
 92 print("------------------------------")
 93 import xgboost as xgb
 94 from sklearn.model_selection import StratifiedKFold
 95 import numpy as np
 96 # xgb矩阵赋值
 97 xgb_train = xgb.DMatrix(vec.transform(x_train), label=y_train)
 98 xgb_test = xgb.DMatrix(vec.transform(x_test))
 99 
100 params = {
101             'booster': 'gbtree',     #使用gbtree
102             'objective': 'multi:softmax',  # 多分类的问题、
103             # 'objective': 'multi:softprob',   # 多分类概率
104             #'objective': 'binary:logistic',  #二分类
105             'eval_metric': 'merror',   #logloss
106             'num_class': 4,  # 类别数，与 multisoftmax 并用
107             'gamma': 0.1,  # 用于控制是否后剪枝的参数,越大越保守，一般0.1、0.2这样子。
108             'max_depth': 8,  # 构建树的深度，越大越容易过拟合
109             'alpha': 0,   # L1正则化系数
110             'lambda': 10,  # 控制模型复杂度的权重值的L2正则化项参数，参数越大，模型越不容易过拟合。
111             'subsample': 0.7,  # 随机采样训练样本
112             'colsample_bytree': 0.5,  # 生成树时进行的列采样
113             'min_child_weight': 3,
114             # 这个参数默认是 1，是每个叶子里面 h 的和至少是多少，对正负样本不均衡时的 0-1 分类而言
115             # 假设 h 在 0.01 附近，min_child_weight 为 1 叶子节点中最少需要包含 100 个样本。
116             'silent': 0,  # 设置成1则没有运行信息输出，最好是设置为0.
117             'eta': 0.03,  # 如同学习率
118             'seed': 1000,
119             'nthread': -1,  # cpu 线程数
120             'missing': 1
121         }

posted on 2019-12-03 15:47 农夫三拳有點疼阅读(2289) 评论(1) 编辑收藏举报

刷新页面返回顶部

农夫三拳有點疼

中文短文本分类

导航