ANIIE: 建立完全开放信息提取基准的说明平台 (AnnIE: An Annotation Platform for Constructing Complete Open Information Extraction Benchmark)

Open Information Extraction (OIE) is the task of extracting facts from sentences in the form of relations and their corresponding arguments in schema-free manner. Intrinsic performance of OIE systems is difficult to measure due to the incompleteness of existing OIE benchmarks: the ground truth extractions do not group all acceptable surface realizations of the same fact that can be extracted from a sentence. To measure performance of OIE systems more realistically, it is necessary to manually annotate complete facts (i.e., clusters of all acceptable surface realizations of the same fact) from input sentences. We propose AnnIE: an interactive annotation platform that facilitates such challenging annotation tasks and supports creation of complete fact-oriented OIE evaluation benchmarks. AnnIE is modular and flexible in order to support different use case scenarios (i.e., benchmarks covering different types of facts). We use AnnIE to build two complete OIE benchmarks: one with verb-mediated facts and another with facts encompassing named entities. Finally, we evaluate several OIE systems on our complete benchmarks created with AnnIE. Our results suggest that existing incomplete benchmarks are overly lenient, and that OIE systems are not as robust as previously reported. We publicly release AnnIE under non-restrictive license.

翻译：开放信息提取系统(OIE)的任务是从各种关系及其相应论据中从判决中以无计划方式提取事实。由于OIE现有基准不完整,OIE系统的内在性能难以测量:地面真相提取没有将从句子中提取的相同事实的所有可接受的表面实现情况归为一组。为了更现实地衡量OIE系统的表现,有必要从输入句子中手工对完整事实进行批注(即所有可接受的对同一事实的表面认识组)。我们建议ANIE:一个互动说明平台,为这种挑战性说明任务提供便利,并支持建立完整的面向事实的OIEE评价基准。ANIE是模块和灵活的,以支持不同的使用案例情景(即涵盖不同类型事实的基准 ) 。我们用ANIE建立两个完整的基准:一个带有verb调解事实,另一个包含被命名实体的事实。最后,我们评估了与ANIE共同创建的完整基准的若干OIE系统。我们的结果表明,现有的不完全的基准过于宽松,我们以前报告的OIEEA系统是不可靠的。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【Yoshua Bengio演讲NeurIPS2019报告】深度学习系统1代到2代，36页ppt，From System 1 Deep Learning to System 2 Deep Learning

专知会员服务

106+阅读 · 2019年12月11日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日