Most of the existing information extraction frameworks (Wadden et al., 2019; Veysehet al., 2020) focus on sentence-level tasks and are hardly able to capture the consolidated information from a given document. In our endeavour to generate precise document-level information frames from lengthy textual records, we introduce the task of Information Aggregation or Argument Aggregation. More specifically, our aim is to filter irrelevant and redundant argument mentions that were extracted at a sentence level and render a document level information frame. Majority of the existing works have been observed to resolve related tasks of document-level event argument extraction (Yang et al., 2018a; Zheng et al., 2019a) and salient entity identification (Jain et al.,2020) using supervised techniques. To remove dependency from large amounts of labelled data, we explore the task of information aggregation using weakly-supervised techniques. In particular, we present an extractive algorithm with multiple sieves which adopts active learning strategies to work efficiently in low-resource settings. For this task, we have annotated our own test dataset comprising of 131 document information frames and have released the code and dataset to further research prospects in this new domain. To the best of our knowledge, we are the first to establish baseline results for this task in English. Our data and code are publicly available at https://github.com/DebanjanaKar/ArgFuse.
翻译:大多数现有信息提取框架(Wadden等人,2019年;Veysehet al.,2020年)侧重于判决层面的任务,很难从某一文件获取综合信息。在努力从长篇文本记录中生成准确的文件层面信息框架时,我们引入了信息聚合或论证聚合的任务。更具体地说,我们的目标是过滤在判决一级提取的不相关和冗余的论据,并提供一个文件级信息框架。观察到了现有工作的主要内容,以解决文件级活动参数提取的相关任务(YangKar等人,2018年a;Zheng等人,2019a)和显要实体识别(Jain等人,202020年),使用监管的技术。为摆脱大量标签数据的依赖性,我们利用薄弱的监管技术探索信息汇总任务。我们展示了具有多种高级学习战略的采掘算法,以便在低资源环境中高效工作。关于这项工作,我们在131个文件级文件/F提取(Yang等人,201818a;Zheng等人,2019a)和显著实体识别(Jain等人, Jain et al.al etrial etal das)的相关测试数据集包括了我们现有英域数据库数据库数据库中的最新数据。