直接引文:关于新闻文章中直接提取和归责的数据集 (DirectQuote: A Dataset for Direct Quotation Extraction and Attribution in News Articles)

Quotation extraction and attribution are challenging tasks, aiming at determining the spans containing quotations and attributing each quotation to the original speaker. Applying this task to news data is highly related to fact-checking, media monitoring and news tracking. Direct quotations are more traceable and informative, and therefore of great significance among different types of quotations. Therefore, this paper introduces DirectQuote, a corpus containing 19,760 paragraphs and 10,279 direct quotations manually annotated from online news media. To the best of our knowledge, this is the largest and most complete corpus that focuses on direct quotations in news texts. We ensure that each speaker in the annotation can be linked to a specific named entity on Wikidata, benefiting various downstream tasks. In addition, for the first time, we propose several sequence labeling models as baseline methods to extract and attribute quotations simultaneously in an end-to-end manner.

翻译：引言的提取和归属是一项具有挑战性的任务,旨在确定引号的长度,并将每一引号归给原演讲人。将这一任务应用于新闻数据与事实检查、媒体监测和新闻跟踪高度相关。直接引文更加可追踪,信息更加丰富,因此在不同类型的引文中具有重大意义。因此,本文介绍了Direct Quote, 包含19 760段和10 279段的手工引用。据我们所知,这是以新闻文本直接引文为重点的最大和最完整的内容。我们确保注解中的每个演讲者都能与关于维基数据的具体名称实体挂钩,使各种下游任务受益。此外,我们首次提出若干顺序标注模式,作为最终提取和标注引文的基线方法。