Text Classification

Text Classification

For purpose of word embedding extrinsic evaluation, especially downstream task.

Some concepts are informed from 复旦大学NLP组

Statistical-Based Method

Logistic Regression

Statistics perspective based text classification described as follow[Li Y 2015].

We use Tencent news titles as our text classification dataset. A total of 8,826 titles of four categories (society, entertainment, healthcare, and military) are extracted. The lengths of titles range from 10 to 20 words. We train ℓ2-regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) with the learned embeddings.

Also described as follow[kiros 2015].

On all datasets, we simply extract skip-thought vectors and train a logistic regression classifier on top.

[Yan Song 2018] also applied this kind of method.

This document classification experiment is performed in a conventional way as that in previous studies [Kiela et al., 2015; Kiros et al., 2015]. For all the documents in training and test datasets, we first construct document level representations by averaging the embeddings from all words in a given document. A logistic regression classifier is then trained on top of the resulted document level representations on the training set and evaluated on the test set.

Linear SVM

It described as follow[Kiela 2015]

we first construct document-level representations by summing the vector representations for all words in a given document. After setting aside a small development set for tuning the hyperparameters of the supervised algorithm, we train a support vector machine (SVM) classifier with a linear kernel and evaluate document topic classification accuracy using ten-fold cross-validation.

Bibliography

复旦大学NLP组. NLP-Beginner. https://github.com/FudanNLP/nlp-beginner

[Li Y. 2015] Li Y, Li W, Sun F, et al. Component-Enhanced Chinese Character Embeddings[J]. empirical methods in natural language processing, 2015: 829-834.

[Kiros 2015] Kiros, Ryan, et al. "Skip-Thought Vectors." Advances in Neural Information Processing Systems 28(2015).

[Yan Song 2018] Song, Yan et al. “Joint Learning Embeddings for Chinese Words and their Components via Ladder Structured Networks.” IJCAI (2018).

[Kiela 2015] Kiela, Douwe et al. “Specializing Word Embeddings for Similarity or Relatedness.” EMNLP (2015).

posted @   健康平安快乐  阅读(501)  评论(0编辑  收藏  举报
编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
历史上的今天:
2018-07-02 Deep Learning Terminologies
点击右上角即可分享
微信分享提示