古腾堡对话数据集 (The Gutenberg Dialogue Dataset) - 专知论文

会员服务 ·

0

任务对话系统 · 数据集 · HTTPS · INTERACT · Better ·

2021 年 1 月 22 日

The Gutenberg Dialogue Dataset

翻译：古腾堡对话数据集

Richard Csaky,Gabor Recski

from arxiv, Accepted at EACL 2021

Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters. We also built a web demo for interacting with our models: https://ricsinaruto.github.io/chatbot.html.

翻译：大型数据集对于许多NLP任务的神经建模至关重要。目前公开提供的开放域对话数据集在质量(例如DailyDialog)和大小(例如OpenSubititits)之间提供了权衡。我们缩小了这一差距,建立了高质量的英文数据集14.8M,德国、荷兰、西班牙、葡萄牙、意大利和匈牙利的数据集较小。我们从Gutenberg项目提供的公共-Domain书籍中提取和处理对话。我们描述了我们的对话提取管道,分析了所使用的各种螺旋体的效果,并对提取的对话进行了错误分析。最后,我们进行了实验,表明通过培训我们的数据,而不是在更大但多为新颖的Opensubitallaties数据集方面,可以实现更好的反应质量。我们的开放源管道(https://github.com/ricsinaruto/gutenberg-dialog)可以扩大到更多的语言。研究人员还可以通过调整我们现有的数据设置的模型/数字模型来进行升级。我们还可以通过调整各种贸易/数字互动的模型来建立自己的数据库。

0

相关内容

任务对话系统

任务对话系统

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

专知会员服务

37+阅读 · 2020年4月10日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

【CCL 2019】ATT-第19期：文本生成 |Text Generation: From the Perspective of Interactive Inference （张家俊）

【CCL 2019】ATT-第19期：文本生成 |Text Generation: From the Perspective of Interactive Inference （张家俊）

专知会员服务

43+阅读 · 2019年11月12日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

196+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

TensorFlow 2.0 Datasets 数据集载入

TensorFlow 2.0 Datasets 数据集载入

TensorFlow

6+阅读 · 2020年1月31日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

别找了，送你 20 个文本数据集

别找了，送你 20 个文本数据集

机器学习算法与Python学习

68+阅读 · 2019年5月17日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

暗通沟渠：Multi-lingual Attention

暗通沟渠：Multi-lingual Attention

我爱读PAMI

7+阅读 · 2018年2月24日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

国内外自然语言处理(NLP)项目主页及其研究组

国内外自然语言处理(NLP)项目主页及其研究组

数据挖掘入门与实战

24+阅读 · 2017年11月28日

ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Arxiv

0+阅读 · 2021年3月17日

Automatic Intent-Slot Induction for Dialogue Systems

Arxiv

0+阅读 · 2021年3月16日

How Useful is Self-Supervised Pretraining for Visual Tasks?

How Useful is Self-Supervised Pretraining for Visual Tasks?

Arxiv

9+阅读 · 2020年3月31日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

End-to-End Open-Domain Question Answering with BERTserini

End-to-End Open-Domain Question Answering with BERTserini

Arxiv

3+阅读 · 2019年9月18日

Text Summarization with Pretrained Encoders

Arxiv

5+阅读 · 2019年8月22日

Zero-Shot Entity Linking by Reading Entity Descriptions

Zero-Shot Entity Linking by Reading Entity Descriptions

Arxiv

6+阅读 · 2019年6月18日

Span Based Open Information Extraction

Arxiv

3+阅读 · 2019年3月1日

Neural Models for Key Phrase Detection and Question Generation

Arxiv

4+阅读 · 2018年5月30日

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Arxiv

4+阅读 · 2017年11月15日

VIP会员

文章信息

相关主题

任务对话系统

相关VIP内容

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

【ACL2020-浙大-微软】多轮对话推理数据集，MuTual: A Dataset for Multi-Turn Dialogue Reasoning

专知会员服务

37+阅读 · 2020年4月10日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

【CCL 2019】ATT-第19期：文本生成 |Text Generation: From the Perspective of Interactive Inference （张家俊）

【CCL 2019】ATT-第19期：文本生成 |Text Generation: From the Perspective of Interactive Inference （张家俊）

专知会员服务

43+阅读 · 2019年11月12日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

196+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能技术提升军事不确定性环境下领导决策能力研究》180页

以机器速度锁定目标：人工智能的能力与局限

中文版 | 革新国家安全：国防情报离线本地部署大语言模型

《美军21世纪医疗抵消战略》

相关资讯

TensorFlow 2.0 Datasets 数据集载入

TensorFlow 2.0 Datasets 数据集载入

TensorFlow

6+阅读 · 2020年1月31日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

别找了，送你 20 个文本数据集

别找了，送你 20 个文本数据集

机器学习算法与Python学习

68+阅读 · 2019年5月17日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

暗通沟渠：Multi-lingual Attention

暗通沟渠：Multi-lingual Attention

我爱读PAMI

7+阅读 · 2018年2月24日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

国内外自然语言处理(NLP)项目主页及其研究组

国内外自然语言处理(NLP)项目主页及其研究组

数据挖掘入门与实战

24+阅读 · 2017年11月28日

相关论文

ENCONTER: Entity Constrained Progressive Sequence Generation via Insertion-based Transformer

Arxiv

0+阅读 · 2021年3月17日

Automatic Intent-Slot Induction for Dialogue Systems

Arxiv

0+阅读 · 2021年3月16日

How Useful is Self-Supervised Pretraining for Visual Tasks?

How Useful is Self-Supervised Pretraining for Visual Tasks?

Arxiv

9+阅读 · 2020年3月31日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

End-to-End Open-Domain Question Answering with BERTserini

End-to-End Open-Domain Question Answering with BERTserini

Arxiv

3+阅读 · 2019年9月18日

Text Summarization with Pretrained Encoders

Arxiv

5+阅读 · 2019年8月22日

Zero-Shot Entity Linking by Reading Entity Descriptions

Zero-Shot Entity Linking by Reading Entity Descriptions

Arxiv

6+阅读 · 2019年6月18日

Span Based Open Information Extraction

Arxiv

3+阅读 · 2019年3月1日

Neural Models for Key Phrase Detection and Question Generation

Arxiv

4+阅读 · 2018年5月30日

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

Arxiv

4+阅读 · 2017年11月15日

微信扫码咨询专知VIP会员