data

1、sota

取三元组：从常识性知识图谱中取去三元组：（head=事件1，relation=事件的关系，tail=事件2），比如Head: PersonX goes to an amusement park, Relation: xIntent, Tail: have fun riding attractions.

三元组转成句子：符号替换，比如PersonX替换成人名：PersonX goes to an amusement park, xIntent, have fun riding attractions

PLM 转述加细节，生成Narrative：David goes to an amusement park and has a blast riding the bumper cars, the Ferris wheel, and the roller coaster. He loves feeling the wind in his hair and screaming at the top of his lungs.

对于只有1个人参与的 Narrative，通过Narrative + prompt PLM引入一个新的对话者：“[narrative] following is a conversation in the scene between [PersonX’s name] and ...”

通过Narrative + 新的对话者 + prompt PLM 生成对话：[narrative] The following is a long in-depth conversation happening in the scene between David and his friend Sarah with multiple turns.\nDavid:”

此时就生成了初步数据集了

对话过滤1：利用模式匹配来过滤掉重复语句，缺少说话人，对话少于4轮的，大于20轮的，超过两个对话者的，像机器人说话的

对话过滤2：Canary模型过滤掉需要人工干预的对话，RewireAPI 过滤掉暴力的对话

对话过滤3：

通过 PLM 验证对话是否能反映出最初的三元组知识：[conversation]\n Q: [relation-tail question]\n A: ，比如 relation-tail question ：Did David intend to have fun riding attractions?，其中不同的relation，对应不同的 relation-tail question 的模板

通过 PLM 验证 Narrative 是否能反映出最初的三元组知识：[narrative]\n Q: [head question]\n A: ，比如 head question：David goes to an amusement park, is this true?

如果 Narrative 回答错误，则过滤掉

姓名bias处理：原有对话可能某个姓名比重过多，则用库里面前10k个名字随机替换对话的所有名字

这样就生成最终数据集 soda，上述构造方法叫做 CO3

作者在 soda上训练了一个对话模型，叫 COSMO

2、ultrachat

包含三个子数据集：Questions about the World （开放领域的各种对话）、Writing and Creation（从头创作）、Assistance on Existent Materials（基于现有材料生成）

造 Questions about the World ：划分了30个普遍的大标题（meta topics）,每个大标题下生成1100+ 小标题（subtopics），对每个小标题产生10个问题，利用这10个问题使用 ChatGPT API 生成更多样的相关问题，采样问题，对每个问题使用2个ChatGPT API 生成3-7轮对话，一个扮演用户，一个用户回复，利用 prompt 让扮演用户的ChatGPT尽可能的模仿人类行为。最后对对话进行后处理

造 Writing and Creation ：划分20个写作类型，对于每个类型，设计200个不同的创作指令，其中 80%的指令要求说的更详细一点，根据指令用ChatGPT创作对话

造 Assistance on Existent Materials：从C4 数据集上取10w个材料，每个材料生成5个问题，将材料和对应的问题放在一起，生成对话

posted @ 2023-04-23 16:23 Jary霸阅读(158) 评论(0) 收藏举报

刷新页面返回顶部

aaa2222339

data

公告