一、英文常见任务
Glue数据下载地址:https://gluebenchmark.com/tasks
1、CoLA
1.1 概念
CoLA(The Corpus of Linguistic Acceptability),在nlp里面是一个单句分类任务,该任务目的是:The CoLA task is to predict whether an English sentence is grammatically plausible.即预测英语句子在语法上是否合理。
1.2 数据介绍
下载后的train.tsv和dev.tsv文件中的每一行包含4个制表符(‘\t’)分隔的列。
第1列: | 代表句子的来源。 |
第2列: | 语法是否可接受(0 =不可接受,1 =可接受)。 |
第3列: | 作者最初指定的可接受性判断。 |
第4列: | 文本描述 |
test.tsv每一行包含2个制表符分隔的列。
第一列 | 样本标号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
第二列 | 文本描述 |
train与dev举例:
clc95 表示句子的来源,0和1代表label标签,*表示作者最初指定的label
clc95 0 * They noticed the painting, but I don't know for how long. clc95 0 * John was tall, but I don't know on what occasions. clc95 1 Joan ate dinner with someone but I don't know who. clc95 1 Joan ate dinner with someone but I don't know who with. clc95 0 * I know that Meg's attracted to Harry, but they don't know who. clc95 0 * Since Jill said Joe had invited Sue, we didn't have to ask who.
test举例:
index表示句子编号,sentence就是文本内容
index sentence 0 Bill whistled past the house. 1 The car honked its way down the road. 2 Bill pushed Harry off the sofa. 3 the kittens yawned awake and played. 4 I demand that the more John eats, the more he pay. 5 If John eats more, keep your mouth shut tighter, OK? 6 His expectations are always lower than mine are.
1.3 评估指标
MCC: The Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classifications.即马修斯相关系数在机器学习中用于衡量二进制(两类)分类的质量
MCC本质上是观察到的和预测的二分类之间的相关系数, 考虑了TP、TN、FP、FN,即使正负类别差异较大时也可以当做一种度量方式,值介于[-1, 1]之间,1表示完美的预测,0不比随机预测的好,-1表示观察到的和预测完全不一致
计算公式:
如果分母中的四个和中的任何一个为零,则分母可以设置为1,此时马修斯相关系数为零,可以通过求极限值进行证明。
2、SST-2
2.1 概念
SST-2(The Stanford Sentiment Treebank),在nlp里面是一个单句分类任务,该任务目的是:The SST-2 task is to determine whether the sentiment of a sentence extracted from movie reviews is positive or negative.即判断电影评论的情感是差评还是好评。
2.2 数据介绍
train.tsv与dev.tsv,每一行包含2个制表符(‘\t’)分隔的列。
第一列 | 评论语句 |
第二列 | 情感标签,1是好评,0是差评 |
test.tsv每一行包含2个制表符分隔的列。
第一列 | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
第二列 | 评论语句 |
train与dev举例:
sentence就是评论语句,label是情感标签
sentence label hide new secretions from the parental units 0 contains no wit , only labored gags 0 that loves its characters and communicates something rather beautiful about human nature 1 remains utterly satisfied to remain the same throughout 0 on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 0 that 's far too tragic to merit such superficial treatment 0 demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an
test举例:
index表示句子编号,sentence就是文本内容
index sentence 0 uneasy mishmash of styles and genres . 1 this film 's relationship to actual tension is the same as what christmas-tree flocking in a spray can is to actual snow : a poor -- if durable -- imitation . 2 by the end of no such thing the audience , like beatrice , has a watchful affection for the monster . 3 director rob marshall went out gunning to make a great one . 4 lathan and diggs have considerable personal charm , and their screen rapport makes the old story seem new . 5 a well-made and often lovely depiction of the mysteries of friendship .
2.3 评估指标
ACC:准确率, 正确预测的正反例数 / 总数
计算公式:
TP:正例预测正确的个数
FP:负例预测错误的个数
TN:负例预测正确的个数
FN:正例预测错误的个数
3、STS-B
3.1 概念
STS-B(Semantic Textual Similarity Benchmark),回归问题,给定一个句子对儿,模型预测一个[0, 5]之间的分数表示两句话的语义相似度。
3.2 数据介绍
STS基准测试包括在2012年至2017年之间根据SemEval组织的STS任务中使用的英语数据集。数据集的选择包括图像标题(image captions),新闻标题(news headlines)和用户论坛(user forums)中的文本。
train.tsv与dev.tsv,每一行包含10个制表符(‘\t’)分隔的列,主要看后三列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
genre | 三个来源,captions、news和forum |
filename | 文件名 |
year | 年份 |
old_index | 旧的索引 |
source1 | 句子1来源 |
source2 | 句子2来源 |
sentence1 | 句子1 |
sentence2 | 句子2 |
score | 情感相似度分数 |
举例:
index genre filename year old_index source1 source2 sentence1 sentence2 score 0 main-captions MSRvid 2012test 0001 none none A plane is taking off. An air plane is taking off. 5.000 1 main-captions MSRvid 2012test 0004 none none A man is playing a large flute. A man is playing a flute. 3.800 2 main-captions MSRvid 2012test 0005 none none A man is spreading shreded cheese on a pizza. A man is spreading shredded cheese on an uncooked pizza. 3.800 3 main-captions MSRvid 2012test 0006 none none Three men are playing chess. Two men are playing chess. 2.600 4 main-captions MSRvid 2012test 0009 none none A man is playing the cello. A man seated is playing the cello. 4.250 5 main-captions MSRvid 2012test 0011 none none Some men are fighting. Two men are fighting. 4.250 6 main-captions MSRvid 2012test 0012 none none A man is smoking. A man is skating. 0.500
test.tsv,每一行包含9个制表符(‘\t’)分隔的列,主要看后2列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
genre | 三个来源,captions、news和forum |
filename | 文件名 |
year | 年份 |
old_index | 旧的索引 |
source1 | 句子1来源 |
source2 | 句子2来源 |
sentence1 | 句子1 |
sentence2 | 句子2 |
举例:
index genre filename year old_index source1 source2 sentence1 sentence2 0 main-captions MSRvid 2012test 0024 none none A girl is styling her hair. A girl is brushing her hair. 1 main-captions MSRvid 2012test 0033 none none A group of men play soccer on the beach. A group of boys are playing soccer on the beach. 2 main-captions MSRvid 2012test 0045 none none One woman is measuring another woman's ankle. A woman measures another woman's ankle. 3 main-captions MSRvid 2012test 0063 none none A man is cutting up a cucumber. A man is slicing a cucumber. 4 main-captions MSRvid 2012test 0066 none none A man is playing a harp. A man is playing a keyboard. 5 main-captions MSRvid 2012test 0074 none none A woman is cutting onions. A woman is cutting tofu. 6 main-captions MSRvid 2012test 0076 none none A man is riding an electric bicycle. A man is riding a bicycle. 7 main-captions MSRvid 2012test 0082 none none A man is playing the drums. A man is playing the guitar. 8 main-captions MSRvid 2012test 0092 none none A man is playing guitar. A lady is playing the guitar. 9 main-captions MSRvid 2012test 0095 none none A man is playing a guitar. A man is playing a trumpet.
3.3 评估指标
Pearson-Spearman Corr 皮尔逊-斯皮尔曼相关系数,这里是两个方法的合称,实则是两种计算方式。(在百度的ERNIE模型中,计算了三个值,分别为皮尔逊相关系数,斯皮尔曼相关系数以及二者的平均值。)
相关系数衡量X、Y量变量之间的相关程度,取值在[-1, 1]之间。
0表示X和Y两个变量无关;当X和Y同趋增大或者减少时,二者呈正相关,相关系数取值在(0, 1];当X和Y异趋增大或者减少时,即X和Y变化相反,二者呈负相关,取值在[-1, 0)。相关系数的绝对值越大,相关性越强,相关系数越接近于1或-1,相关度越强,相关系数越接近于0,相关度越弱。
注意: 计算系数的时候会涉及两个值,p值和r值,r值表就是上面公式计算的相关性的大小;p值是检验值,表示显著性,一般P小于0.05时表示显著,即在当前的样本下可以明显的观察到两变量的相关,两个变量的相关有统计学意义。如果只看r值是有偏差的,两者之间的相关可能由于偶然因素引起的。
4、MRPC
4.1 概念
MRPC(Microsoft Research Paraphrase Corpus),句子对儿分类问题,The task is to predict whether each pair captures a paraphrase/semantic equivalence,即给定一个句子对儿,判断他们在语义上是否相同。
4.2 数据介绍
数据包含从网络新闻源中提取的5800对句子,以及表示每个句对是否在释义/语义上是相同的。
train.tsv和dev.tsv,每一行包含5个制表符(‘\t’)分隔的列。
Quality | label标签,句对语义相同label=1,否则label=0 |
#1 ID | 第一句话的id |
#2 ID | 第二句话的id |
#1 String | 第一句话文本 |
#2 String | 第二句话的文本 |
举例:
Quality #1 ID #2 ID #1 String #2 String 1 702876 702977 Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence . Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence . 0 2108705 2108831 Yucaipa owned Dominick 's before selling the chain to Safeway in 1998 for $ 2.5 billion . Yucaipa bought Dominick 's in 1995 for $ 693 million and sold it to Safeway for $ 1.8 billion in 1998 . 1 1330381 1330521 They had published an advertisement on the Internet on June 10 , offering the cargo for sale , he added . On June 10 , the ship 's owners had published an advertisement on the Internet , offering the explosives for sale . 0 3344667 3344648 Around 0335 GMT , Tab shares were up 19 cents , or 4.4 % , at A $ 4.56 , having earlier set a record high of A $ 4.57 . Tab shares jumped 20 cents , or 4.6 % , to set a record closing high at A $ 4.57 . 1 1236820 1236712 The stock rose $ 2.11 , or about 11 percent , to close Friday at $ 21.51 on the New York Stock Exchange . PG & E Corp. shares jumped $ 1.63 or 8 percent to $ 21.03 on the New York Stock Exchange on Friday .
test.tsv,每一行包含5个制表符(‘\t’)分隔的列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
#1 ID | 第一句话的id |
#2 ID | 第二句话的id |
#1 String | 第一句话文本 |
#2 String | 第二句话的文本 |
举例:
index #1 ID #2 ID #1 String #2 String 0 1089874 1089925 PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So . Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So . 1 3019446 3019327 The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected . Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash . 2 1945605 1945824 According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 . The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 . 3 1430402 1430329 A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night . A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night . 4 3354381 3354396 The company didn 't detail the costs of the replacement and repairs . But company officials expect the costs of the replacement work to run into the millions of dollars . 5 1390995 1391183 The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added . Under the agreement , the settling companies will also assign their potential claims against the underwriters to the investors , he added .
4.3 评估指标
1)ACC
2)F1:precision和recall的综合平均[调和平均数],既可以兼顾precision又可以兼顾recall。F1_score越高说明precision和recall达到了一个很高的平衡点。
计算公式:
其中:
5、QQP
5.1 概念
QQP(Quora Question Pairs) ,类似MRPC,也是句子对儿分类任务,检测成对的文本是否实际上对应于语义等效的查询。
5.2 数据介绍
数据集的发布是针对与Quora相关的各种问题,数据集包含超过40万行潜在的问题重复对儿。每行包含该对中每个问题的ID,每个问题的全文以及指示该行是否确实包含重复对的二进制值。
train.tsv和dev.tsv,每一行包含5个制表符(‘\t’)分隔的列。
id | id编号 |
qid1 | 第一句话的id |
qid2 | 第二句话的id |
question1 | 第一个问题的文本 |
question2 | 第二个问题的文本 |
is_duplicate | label标签,两句话语义是否重复,重复label=1,否则label=0 |
举例:
id qid1 qid2 question1 question2 is_duplicate 133273 213221 213222 How is the life of a math student? Could you describe your own experiences? Which level of prepration is enough for the exam jlpt5? 0 402555 536040 536041 How do I control my horny emotions? How do you control your horniness? 1 360472 364011 490273 What causes stool color to change to yellow? What can cause stool to come out as little balls? 0 150662 155721 7256 What can one do after MBBS? What do i do after my MBBS ? 1 183004 279958 279959 Where can I find a power outlet for my laptop at Melbourne Airport? Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melbourne and Sydney? 0 119056 193387 193388 How not to feel guilty since I am Muslim and I'm conscious we won't have sex together? I don't beleive I am bulimic, but I force throw up atleast once a day after I eat something and feel guilty. Should I tell somebody, and if so who? 0
test.tsv,每一行包含3个制表符(‘\t’)分隔的列。
id | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
question1 | 问题1文本 |
question2 | 问题2文本 |
举例:
id question1 question2 0 Would the idea of Trump and Putin in bed together scare you, given the geopolitical implications? Do you think that if Donald Trump were elected President, he would be able to restore relations with Putin and Russia as he said he could, based on the rocky relationship Putin had with Obama and Bush? 1 What are the top ten Consumer-to-Consumer E-commerce online? What are the top ten Consumer-to-Business E-commerce online? 2 Why don't people simply 'Google' instead of asking questions on Quora? Why do people ask Quora questions instead of just searching google? 3 Is it safe to invest in social trade biz? Is social trade geniune? 4 If the universe is expanding then does matter also expand? If universe and space is expanding? Does that mean anything that occupies space is also expanding? 5 What is the plural of hypothesis? What is the plural of thesis?
5.3 评估指标
1)ACC
2)F1
6、MNLI
6.1 概念
MNLI(Multi-Genre Natural Language Inference),自然语言推断任务,where the goal is to predict whether a sentence is an entailment,contradiction,or neutral with respect to the other.即预测两个句子,是entailment(相近的), contradiction(矛盾的)还是neutral(中立的)
6.2 数据介绍
MultiNLI自然推断语料库是一个众包的433k句子对的集合,带有文本蕴含信息。语料库以SNLI语料库为模型,但是不同之处在于它涵盖了多种口语和书面语体,并支持独特的跨语体泛化评估。MNLI测试集与验证集分为两类,matched和mismatched,训练的时候直接使用train.tsv训练,验证和测试的时候分别用matched和mismatched的数据集进行评估。
train.tsv,每一行包含12个制表符(‘\t’)分隔的列;dev_matched.tsv和dev_dismatched.tsv,每一行包含16个制表符(‘\t’)分隔的列。无论train还是dev,主要看0、8、9与最后一列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
sentence1 | 第一句话 |
sentence2 | 第二句话 |
gold_label | label=(entailment、contradiction、neutral) |
test_matched.tsv和test_dismatched.tsv,每一行包含10个制表符(‘\t’)分隔的列。无论train还是dev,主要看0、8、9三列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
sentence1 | 第一句话 |
sentence2 | 第二句话 |
6.3 评估指标
ACC
7、QNLI
7.1 概念
QNLI(Question Natural Language Inference),二分类任务,The task involves assessing whether a sentence contains the correct answer to a given query.即评估sentence中是否包含了question的答案。
7.2 数据介绍
数据集是斯坦福问答数据集Stanford Question Answering Dataset (SQuAD),是一个阅读理解数据集,由工作者在一组Wikipedia文章上提出的问题组成,其中每个问题的答案都是对应阅读段落的一段文本或跨度,否则该问题可能无法回答。
train.tsv和dev.tsv,每一行包含4个制表符(‘\t’)分隔的列
index | 数据集索引 |
question | 问题 |
sentence | 答案 |
label | 问题与答案匹配label=entailment,否则label=not_entailment |
举例
index question sentence label 0 What came into force after the new constitution was herald? As of that day, the new constitution heralding the Second Republic came into force. entailment 1 What is the first major city in the stream of the Rhine? The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. not_entailment 2 What is the minimum required if you want to teach in Canada? In most provinces a second Bachelor's Degree such as a Bachelor of Education is required to become a qualified teacher. not_entailment 3 How was Temüjin kept imprisoned by the Tayichi'ud? The Tayichi'ud enslaved Temüjin (reportedly with a cangue, a sort of portable stocks), but with the help of a sympathetic guard, the father of Chilaun (who later became a general of Genghis Khan), he was able to escape from the ger (yurt) in the middle of the night by hiding in a river crevice.[citation needed] entailment 4 What did Herr Gott, dich loben wir become known as ? He paraphrased the Te Deum as "Herr Gott, dich loben wir" with a simplified form of the melody. not_entailment
test.tsv,每一行包含3个制表符(‘\t’)分隔的列
index | 数据集索引 |
question | 问题 |
sentence | 答案 |
举例
index question sentence 0 What organization is devoted to Jihad against Israel? For some decades prior to the First Palestine Intifada in 1987, the Muslim Brotherhood in Palestine took a "quiescent" stance towards Israel, focusing on preaching, education and social services, and benefiting from Israel's "indulgence" to build up a network of mosques and charitable organizations. 1 In what century was the Yarrow-Schlick-Tweedy balancing system used? In the late 19th century, the Yarrow-Schlick-Tweedy balancing 'system' was used on some marine triple expansion engines. 2 The largest brand of what store in the UK is located in Kingston Park? Close to Newcastle, the largest indoor shopping centre in Europe, the MetroCentre, is located in Gateshead. 3 What does the IPCC rely on for research? In principle, this means that any significant new evidence or events that change our understanding of climate science between this deadline and publication of an IPCC report cannot be included. 4 What is the principle about relating spin and space variables? Thus in the case of two fermions there is a strictly negative correlation between spatial and spin variables, whereas for two bosons (e.g. quanta of electromagnetic waves, photons) the correlation is strictly positive.
7.3 评估指标
ACC
8、RTE
8.1 概念
RTE(Recognizing Textual Entailment),分类任务,类似MNLI,This task requires to recognize, given two text fragments, whether the meaning of one text is entailed (can be inferred) from the other text.即在给定两个文本片段的情况下,此任务需要识别一个文本的含义是否可以被另一文本推断出来。
8.2 数据介绍
RTE一项通用任务,可以捕获许多NLP应用程序中的主要语义推理需求,例如问题回答,信息检索,信息提取和文本摘要。
train.tsv和dev.tsv,每一行包含4个制表符(‘\t’)分隔的列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
sentence1 | 第一句话 |
sentence2 | 第二句话 |
label | label=(entailment、not_entailment) |
index sentence1 sentence2 label 0 Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation. Christopher Reeve had an accident. not_entailment 1 Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations. Bacteria is winning the war against antibiotics. entailment 2 Cairo is now home to some 15 million people - a burgeoning population that produces approximately 10,000 tonnes of rubbish per day, putting an enormous strain on public services. In the past 10 years, the government has tried hard to encourage private investment in the refuse sector, but some estimate 4,000 tonnes of waste is left behind every day, festering in the heat as it waits for someone to clear it up. It is often the people in the poorest neighbourhoods that are worst affected. But in some areas they are fighting back. In Shubra, one of the northern districts of the city, the residents have taken to the streets armed with dustpans and brushes to clean up public areas which have been used as public dumps. 15 million tonnes of rubbish are produced daily in Cairo. not_entailment 3 The Amish community in Pennsylvania, which numbers about 55,000, lives an agrarian lifestyle, shunning technological advances like electricity and automobiles. And many say their insular lifestyle gives them a sense that they are protected from the violence of American society. But as residents gathered near the school, some wearing traditional garb and arriving in horse-drawn buggies, they said that sense of safety had been shattered. "If someone snaps and wants to do something stupid, there's no distance that's going to stop them," said Jake King, 56, an Amish lantern maker who knew several families whose children had been shot. Pennsylvania has the biggest Amish community in the U.S. not_entailment 4 Security forces were on high alert after an election campaign in which more than 1,000 people, including seven election candidates, have been killed. Security forces were on high alert after a campaign marred by violence. entailment 5 In 1979, the leaders signed the Egypt-Israel peace treaty on the White House lawn. Both President Begin and Sadat received the Nobel Peace Prize for their work. The two nations have enjoyed peaceful relations to this day. The Israel-Egypt Peace Agreement was signed in 1979. entailment
test.tsv,每一行包含3个制表符(‘\t’)分隔的列。
index | 数据集索引 |
sentence1 | 句子1 |
sentence2 | 句子2 |
举例
index sentence1 sentence2 0 Mangla was summoned after Madhumita's sister Nidhi Shukla, who was the first witness in the case. Shukla is related to Mangla. 1 Authorities in Brazil say that more than 200 people are being held hostage in a prison in the country's remote, Amazonian-jungle state of Rondonia. Authorities in Brazil hold 200 people as hostage. 2 A mercenary group faithful to the warmongering policy of former Somozist colonel Enrique Bermudez attacked an IFA truck belonging to the interior ministry at 0900 on 26 March in El Jicote, wounded and killed an interior ministry worker and wounded five others. An interior ministry worker was killed by a mercenary group. 3 The British ambassador to Egypt, Derek Plumbly, told Reuters on Monday that authorities had compiled the list of 10 based on lists from tour companies and from families whose relatives have not been in contact since the bombings. Derek Plumbly resides in Egypt. 4 Tibone estimated diamond production at four mines operated by Debswana -- Botswana's 50-50 joint venture with De Beers -- could reach 33 million carats this year. Botswana is a business partner of De Beers. 5 His wife Strida won a seat in parliament after forging an alliance with the main anti-Syrian coalition in the recent election. Strida elected to parliament.
8.3 评估指标
ACC
9、WNLI
9.1 概念
WNLI(Winograd NLI),二分类任务,判断两个句子含义是否一样。
9.2 数据介绍
Winograd模式是一对句子,它们之间只有一个或两个单词不同,并且有可能包含歧义。
例如:
1)警察逮捕了所有团伙成员,他们试图阻止附近的毒品交易。 警察试图阻止附近的毒品交易。这两句话都表示警察阻止毒品交易,label=1
2)史蒂夫在所有事情上都遵循弗雷德的榜样,他对他影响很大。 史蒂夫对他的影响很大。第一句话可能表示弗雷德对史蒂夫影响很大,这就与第二句意思不同,两句话有歧义,label=0
train.tsv和dev.tsv,每一行包含4个制表符(‘\t’)分隔的列。
index | 样本编号,从0开始计数,0表示第一条样本,1表示第二条,以此类推 |
sentence1 | 第一句话 |
sentence2 | 第二句话 |
label | 两句话意思相同label=1,否则label=0 |
举例
index sentence1 sentence2 label 0 I stuck a pin through a carrot. When I pulled the pin out, it had a hole. The carrot had a hole. 1 1 John couldn't see the stage with Billy in front of him because he is so short. John is so short. 1 2 The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood. The police were trying to stop the drug trade in the neighborhood. 1 3 Steve follows Fred's example in everything. He influences him hugely. Steve influences him hugely. 0 4 When Tatyana reached the cabin, her mother was sleeping. She was careful not to disturb her, undressing and climbing back into her berth. mother was careful not to disturb her, undressing and climbing back into her berth. 0 5 George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it. George was particularly eager to see it. 0
test.tsv,每一行包含3个制表符(‘\t’)分隔的列。
index | 数据集索引 |
sentence1 | 句子1 |
sentence2 | 句子2 |
举例
index sentence1 sentence2 0 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when Maude and Dora came in sight. 1 Maude and Dora had seen the trains rushing across the prairie, with long, rolling puffs of black smoke streaming back from the engine. Their roars and their wild, clear whistles could be heard from far away. Horses ran away when they came in sight. Horses ran away when the trains came in sight.
9.3 评估指标
ACC
10、Diagnostics Main
10.1 概念
Diagnostics Main,分类任务,下载下来的无标签数据集,任务最接近MultiNLI,提交结果时,应在诊断数据上运行模型的MultiNLI预测变量。官网也有带标签的数据集。
10.1 数据介绍
数据由数百个句子对组成,它们在两个方向上都标记了它们的蕴含关系(蕴含,矛盾或中立),并标记了一组与证明蕴含标签相关的语言现象。它是由GLUE的作者手动构建的,并且从几种不同的来源中提取文本,包括新闻,学术和百科全书文本以及社交媒体。句子对经过精心设计,使得一对句子中的每个句子都非常相似,从而使依赖简单词汇提示和统计信息的系统的问题更加棘手。
数据集每一行包含3个制表符(‘\t’)分隔的列。
index | 数据集索引 |
sentence1 | 句子1 |
sentence2 | 句子2 |
举例
index sentence1 sentence2 0 The cat sat on the mat. The cat did not sit on the mat. 1 The cat did not sit on the mat. The cat sat on the mat. 2 When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow. When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow. 3 When you've got snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow. When you've got no snow, it's really hard to learn a snow sport so we looked at all the different ways I could mimic being on snow without actually being on snow.
10.3 评估指标
MCC