文本分类 Text Classification

什么是文本分类

文本分类任务是NLP十分常见的任务大类,他的输入一般是文本信息,输出则是预测得到的分类标签。主要的文本分类任务有主题分类、情感分析 、作品归属、真伪检测等,很多问题其实通过转化后也能用分类的方法去做。

常规步骤

  1. 选择一个感兴趣的任务
  2. 收集合适的数据集
  3. 做好标注
  4. 特征选择
  5. 选择一个机器学习方法
  6. 利用验证集调参
  7. 可以多尝试几种算法和参数
  8. 训练final模型
  9. Evaluate测试集

机器学习算法

这里简单介绍几个机器学习(基础)算法

1. 朴素贝叶斯 Naive Bayes

假设特征之间是相互独立的,利用贝叶斯法则,寻找最有可能的class,

\[P(c_n|f_1...f_m) = \prod_{i=1}^mp(f_i|c_n)p(c_n) \]

优点:Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.

缺点:Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations

2. 逻辑回归 Logistic Regression

逻辑回归是由线性回归做了点改动得来的,利用一个link function进行转化,有点”化曲为直“的味道,能够输出一个0-1的概率。

\[P(c_n|f_1...f_m) = \frac{1}{Z} * exp(\sum_{i=0}^mw_if_i) \]

训练的方法和回归模型差不多,利用cost函数来求weight,还可以添加正则项(regularisation)作为惩罚项。

优点: Unlike Naïve Bayes not confounded by diverse, correlated features

缺点: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem

3. Support Vector Machines (SVD)

主要思想:找到一个超平面能够区分训练数据从而进行测试集的分类,这里不展开。

优点: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets

缺点: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable

4. K-Nearest Neighbour (KNN)

主要思想:根据观测数据与已有数据的距离(可以是欧几里得距离、cosine距离),取最接近的标签作为观测数据的标签。

优点: Simple, effective; no training required; inherently multiclass; optimal with infinite data

缺点: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully

5. 决策树 Decision Tree

主要思想:利用feature信息构建树,最后的叶子节点就是class类。

优点: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems

缺点: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets

6. 随机森林 Random Forest

主要思想:有多个决策树构成,通过最后投票选定标签。

优点: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised

缺点: Same negatives as decision trees: too slow with large feature sets

7. 神经网络 Neural Network

nn

主要思想:将多个神经层节点之间相互联系,每个节点把前一层的weight传递到下一层,这里不展开,其实本质还是linear regression。

优点: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision

缺点: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting

调参

我们在使用训练集训练完数据后,可以用验证集进行调参,常用的调参方法有k-fold cross-validation,grid search

评估

常用的评估标准:

  1. Accuracy = 正确数/总数

  2. Precision = tp/tp+fp

  3. Recall = tp/tp+fn

  4. F1-score = 2 * precision * recall / (precision + recall)

另外还有macro f-score 和 micro f-score,想进一步了解的可以点这里

posted @ 2020-06-19 19:58  MrDoghead  阅读(1869)  评论(0编辑  收藏  举报