关系抽取 -- 评测数据集简述

常用数据集

ACE 2005: 599 docs. 7 types;
SemiEval 2010 Task8 Dataset:
- 19 types
- train data: 8000
- test data: 2717
NYT+FreeBase 通过Distant Supervised method 提取，里面会有噪音数据:
- 53 types
- train data: 522611 sentences; 需要注意的是，这里面有近80%的句子的标签为NA
- test data: 172448 sentences;

下面以学习方法的不同来对这些文章进行分类：

Fully Supervised Learning
Distant Supervised Learning
Joint Learning with entity and relation
Tree Based Methods

其中：

　　Fully Supervised 一般评测使用label完全准确的SemEval 2010 Task 8 数据集。

　　格式：　　　　

　　　　1 The <e1>microphone</e1> converts sound into an electrical <e2>signal</e2>.
　　　　2 Cause-Effect(e1,e2)
　　　　3 Comment:

　　　　其中第一行为sentence，第二行为两个entity的relation，第三行为备注。

　　Distant Supervised 使用NYT+FreeBase数据集。 NYT 训练数据样例:

　　　　 1 m.0ccvx　　m.05gf08　　queens　　belle_harbor　　/location/location/contains　　.....officials yesterday to reopen their investigation into the fatal crash of a passenger jet in belle_harbor , queens......　　###END###

　　　　一共6列，前两列为两个entity的Freebase mid, 第三四列为两个entity在句子中的string。第五列为relation，最后一列为sentence（有省略），以###END###结尾

这两个数据集相对来说用的最广泛。

　　在NYT数据集上，常用的有两个版本的数据集：

　　　　　27类关系，Zeng2015,Ji2017等用到的经过过滤之后的数据集，相对较小，以SMALL表示。

　　　　　53类关系，Lin2016 发布的数据集，相对较大，训练数据大概是小数据的4倍，以LARGE表示。

posted @ 2019-11-01 16:51 _Meditation 阅读(1809) 评论(0) 收藏举报

刷新页面返回顶部

Meditation

埋滴忒深

关系抽取 -- 评测数据集简述

公告

Meditation

埋滴忒深

关系抽取 -- 评测 数据集 简述

公告

关系抽取 -- 评测数据集简述