NLP models that compare or consolidate information across multiple documents often struggle when challenged with recognizing substantial information redundancies across the texts. For example, in multi-document summarization it is crucial to identify salient information across texts and then generate a non-redundant summary, while facing repeated and usually differently-phrased salient content. To facilitate researching such challenges, the sentence-level task of \textit{sentence fusion} was proposed, yet previous datasets for this task were very limited in their size and scope. In this paper, we revisit and substantially extend previous dataset creation efforts. With careful modifications, relabeling and employing complementing data sources, we were able to triple the size of a notable earlier dataset. Moreover, we show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set, which substantially improves model training.
翻译:比较或合并多个文件的信息的NLP模式在遇到承认文本中大量信息冗余的挑战时,往往难以做到。例如,在多份文件汇总中,关键是要查明各文本中的突出信息,然后产生非冗余的概要,同时面对重复和通常不同语法的突出内容。为了便利研究这些挑战,提出了关于\textit{resentence communction}的句级任务,但以前用于这项任务的数据集在规模和范围上都非常有限。在本文件中,我们重新审视并大大扩展了先前的数据集创建工作。经过仔细修改、重新标签和使用补充数据源,我们得以将先前一个显著数据集的篇幅增加三倍。此外,我们表明,我们的扩大版本在多文件任务中使用了更具代表性的文本,并提供更大规模和更加多样化的培训,大大改进了示范培训。