Building document-grounded dialogue systems have received growing interest as documents convey a wealth of human knowledge and commonly exist in enterprises. Wherein, how to comprehend and retrieve information from documents is a challenging research problem. Previous work ignores the visual property of documents and treats them as plain text, resulting in incomplete modality. In this paper, we propose a Layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents (VRDs), so as to generate accurate responses in dialogue systems. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents, becoming the largest VRD-based information extraction dataset to the best of our knowledge. We also develop benchmark methods that extend the token-based language model to consider layout features like humans. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
翻译:建立基于文件的对话系统已日益引起人们的兴趣,因为文件传递着丰富的人类知识,并且通常存在于企业中。在这里,如何理解和检索来自文件的信息是一个具有挑战性的研究问题。以前的工作忽略了文件的视觉属性,将文件当作纯文本处理,从而造成不完整的模式。在本文中,我们提议采用一个具备版式的文档级信息提取数据集LIE,以便利从视觉丰富的文件中提取结构和语义知识,从而在对话系统中产生准确的响应。 LIE包含从4 061页产品和正式文件中提取的三项任务62k说明,成为基于VRD的最大信息提取数据集,成为我们知识中最丰富的基于VRD的信息提取数据集。我们还制定了一些基准方法,扩大基于代用语言的模型,以考虑像人类这样的布局特征。经验显示,布局对于基于VRD的提取至关重要,系统演示还证实所提取的知识有助于找到用户所关心的答案。