This paper introduces Doc2Bot, a novel dataset for building machines that help users seek information via conversations. This is of particular interest for companies and organizations that own a large number of manuals or instruction books. Despite its potential, the nature of our task poses several challenges: (1) documents contain various structures that hinder the ability of machines to comprehend, and (2) user information needs are often underspecified. Compared to prior datasets that either focus on a single structural type or overlook the role of questioning to uncover user needs, the Doc2Bot dataset is developed to target such challenges systematically. Our dataset contains over 100,000 turns based on Chinese documents from five domains, larger than any prior document-grounded dialog dataset for information seeking. We propose three tasks in Doc2Bot: (1) dialog state tracking to track user intentions, (2) dialog policy learning to plan system actions and contents, and (3) response generation which generates responses based on the outputs of the dialog policy. Baseline methods based on the latest deep learning models are presented, indicating that our proposed tasks are challenging and worthy of further research.
翻译:本文介绍Doc2Bot,这是用于建立有助于用户通过对话获取信息的机器的新数据集,对于拥有大量手册或教学书籍的公司和组织来说,这是特别有意义的。尽管我们的任务具有潜力,但我们的任务性质提出了几项挑战:(1)文件包含各种结构,妨碍了机器的理解能力,(2)用户信息需求往往未得到充分说明。与以前侧重于单一结构类型或忽视询问以发现用户需求的作用的数据集相比,Doc2Bot数据集是用来系统地应对此类挑战的。我们的数据集包含基于五个域的中国文档的100 000多个转折,比以往任何基于文件的对话框数据集都大。我们在Doc2Bot中提出了三项任务:(1) 跟踪用户意图的对话跟踪,(2) 对话政策学习以规划系统行动和内容,(3) 反应生成,根据对话政策产出产生反应。根据最新的深层次学习模型提出了基线方法,表明我们拟议的任务具有挑战性,值得进一步研究。