文本分类数据集
这篇文章列举了文本分类数据集,这些数据集大多数可以在 Hugging Face 文本分类任务数据集 上面找到并下载使用。
速览
文本分类:
IMDB:影评
AGNews: 新闻归档数据集
CoLA: 语言可接受性语料库,判断是否符合语法
SST2:斯坦福情感分析数据集,主要是影评
rotten-tomatoes:烂番茄影评
Yelp Review:商户点评数据集
Yahoo! Answers Topic Classification Dataset:雅虎问答话题分类
Amazon polarity:亚马逊商品评论数据集
自然语言推理(句子对分类):
SNLI:Stanford Natural Language Inference,自然语言推理。
MNLI:Multi-Genre Natural Language Inference,多类型自然语言推理数据库。
MRPC:微软研究院释义语料库,自然语言推理。
QNLI:从 Stanford 的 SQuAD 导出的自然语言推理数据集
QQP:Quora 问题对数据集
RTE:识别文本蕴含数据集
STS-B:两个句子的相似度
WNLI:Winograd 自然语言推断
新闻归档
Yahoo! Answers Topic Classification Dataset
下载地址
一个获取该数据集的简单方法是用 huggingface datasets 加载数据集,而在源码里面我们可以找到下载地址:https://s3.amazonaws.com/fast-ai-nlp/yahoo_answers_csv.tgz
数据集介绍
Yahoo! 问答话题分类数据集,这个数据集一共有 1,400,000 个训练样本,60,000 个测试样本。每个样本包含 4 个值域,例子如下:
topic_id: "5",
question_title: "why doesn't an optical mouse work on a glass table?",
question_content: "or even on some surfaces?",
best_answer: "Optical mice use an LED and a camera to rapidly capture images of the surface
beneath the mouse. The infomation from the camera is analyzed by a DSP (Digital
Signal Processor) and used to detect imperfections in the underlying surface and
determine motion. Some materials, such as glass, mirrors or other very shiny,
uniform surfaces interfere with the ability of the DSP to accurately analyze
the surface beneath the mouse. \nSince glass is transparent and very uniform,
the mouse is unable to pick up enough imperfections in the underlying surface
to determine motion. Mirrored surfaces are also a problem, since they constantly
reflect back the same image, causing the DSP not to recognize motion properly.
When the system is unable to see surface changes associated with movement, the
mouse will not work properly."
标签一共有 10 个
Society & Culture
Science & Mathematics
Health
Education & Reference
Computers & Internet
Sports
Business & Finance
Entertainment & Music
Family & Relationships
Politics & Government
AGNews
下载地址
训练集:https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
测试集:https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
数据集介绍
AGNews 数据集由学术新闻搜索引擎 ComeToMyHead 搜集而成,新闻数据源多达 2000 个。数据集一共有 120,000 个训练样本,7,600 个测试样本。每个样本包含 3 个值域,例子如下:
id: "3",
headline: "Wall St. Bears Claw Back Into the Black (Reuters)",
content: "Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics,
are seeing green again."
标签一共有 4 个,没有给出标签的具体含义,只有四个数字。
情感分析
SST-2
下载地址
https://dl.fbaipublicfiles.com/glue/data/SST-2.zip
数据集介绍
SST 是 Standford Sentiment Treebank 的缩写,数据主要来自影评。数据集划分为 train/dev/test 三份,分别包含 67359、873、1822 个样本。每个样本包含 2 个值域,例子如下:
sentence: "for those moviegoers who complain that ` they don't
make movies like they used to anymore"
label: "0"
标签一共有 2 个,0 表示 negative,1 表示 positive。
IMDB
下载地址
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
数据集介绍
IMDB 影评数据集。数据集划分了为训练集和测试集,各包含 25000 个样本,其中每个划分中,正例和负例各有 12500 个样本。每个样本按照文件夹进行组织,下面给出一个训练集中的正例:
If you like adult comedy cartoons, like South Park, then this is nearly a similar
format about the small adventures of three teenage girls at Bromwell High.
Keisha, Natella and Latrina have given exploding sweets and behaved like bitches
, I think Keisha is a good leader. There are also small stories going on with the
teachers of the school. There's the idiotic principal, Mr. Bip, the nervous Maths
teacher and many others. The cast is also fantastic, Lenny Henry's Gina Yashere,
EastEnders Chrissie Watts, Tracy-Ann Oberman, Smack The Pony's Doon Mackichan,
Dead Ringers' Mark Perry and Blunder's Nina Conti. I didn't know this came from
Canada, but it is very good. Very good!
标签一共有两个,pos 和 neg。
Yelp Review Full
下载地址
https://s3.amazonaws.com/fast-ai-nlp/yelp_review_full_csv.tgz
数据集介绍
Yelp 是美国的商户点评网站,类似大众点评。数据集划分为训练集和测试集,其中训练集有 650000 个样本,测试集有 50000 个样本。
label: "5",
text: "dr. goldberg offers everything i look for in a general practitioner.
he's nice and easy to talk to without being patronizing; he's always on
time in seeing his patients; he's affiliated with a top-notch hospital
(nyu) which my parents have explained to me is very important in case
something happens and you need surgery; and you can get referrals to see
specialists without having to see him first. really, what more do you
need? i'm sitting here trying to think of any complaints i have about
him, but i'm really drawing a blank."
标签是评级,从 1 个星星到 5 个星星。
Amazon polarity
下载地址
https://s3.amazonaws.com/fast-ai-nlp/amazon_review_polarity_csv.tgz
数据集介绍
Amazon polarity 是亚马逊根据商品评论构成的情感极性分析数据集,原始数据集类似 Yelp 数据集,给商品打分,从 1 分到 5 分,这个情感极性数据集将 1 分、2 分作为差评,4 分、5 分作为好评,3 分忽略。这个数据集有 656MiB,相比常见的 NLP 数据集算是比较大的了,其中训练集有 个样本,测试集有 个样本。
class: "2",
review title: "Stuning even for the non-gamer",
review text: "This sound track was beautiful! It paints the senery in your mind
so well I would recomend it even to people who hate vid. game music!
I have played the game Chrono Cross but out of all of the games I
have ever played it has the best music! It backs away from crude
keyboarding and takes a fresher step with grate guitars and soulful
orchestras. It would impress anyone who cares to listen! ^_^"
标签只有正负两个类别,数值 1 对应 negative,数值 2 对应 positive。
仇恨言论
Automated Hate Speech Detection and the Problem of Offensive Language
下载地址
Github 仓库:https://github.com/t-davidson/hate-speech-and-offensive-language
数据集介绍
这个数据集来源于 ICWSM 2017 的一篇文献 Automated Hate Speech Detection and the Problem of Offensive Language。作者使用 Twitter API 搜索包含特定词汇表的推特,这些推特来自于 33458 个用户,爬取每个用户的 timeline,一共包含有 85.4 million 条数据。最后经过随机采样 25k 条数据,经过众包在这上面打标签。数据集有三个类别:hate speech, offensive but not hate speech, or neither. 每条数据经过至少三个人打标签,最后经过投票法选出类别。
数据样例如下,每条数据的第一个数字为 id,第二为打标签的人数,第三、四、五分别代表每个类别的投票人数,第六个数字为类别(0 hate speech,1 offensive language,2 neither),最后是推文本身。
,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. & as a man you should always take the trash out...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn bad for cuffin dat hoe in the 1st place!!
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby4life: You ever fuck a bitch and she start to cry? You be confused as shit
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she look like a tranny
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you hear about me might be true or it might be faker than the bitch who told it to ya 
5,3,1,2,0,1,"!!!!!!!!!!!!!!!!!!""@T_Madison_x: The shit just blows me..claim you so faithful and down for somebody but still fucking with hoes! 😂😂😂"""