科学文件中多层标题数字表的计量类识别 (Metric-Type Identification for Multi-Level Header Numerical Tables in Scientific Papers)

Numerical tables are widely used to present experimental results in scientific papers. For table understanding, a metric-type is essential to discriminate numbers in the tables. We introduce a new information extraction task, metric-type identification from multi-level header numerical tables, and provide a dataset extracted from scientific papers consisting of header tables, captions, and metric-types. We then propose two joint-learning neural classification and generation schemes featuring pointer-generator-based and BERT-based models. Our results show that the joint models can handle both in-header and out-of-header metric-type identification problems.

翻译：数字表格被广泛用于在科学论文中介绍实验结果。为了理解表格,对表格中的数字进行区分至关重要。我们引入了新的信息提取任务,即从多级页头数字表格中进行类型识别,并提供从科学论文中提取的数据集,包括页头表格、标题和计量类型。然后,我们提出两个联合学习神经分类和生成计划,以指针式和BERT型模型为主。我们的结果显示,联合模型既可以处理标题型,也可以处理标题型外的识别型号。

相关内容

信息抽取

关注 350

信息抽取（Information Extraction: IE）是把文本里包含的信息进行结构化处理，变成表格一样的组织形式。输入信息抽取系统的是原始文本，输出的是固定格式的信息点。信息点从各种各样的文档中被抽取出来，然后以统一的形式集成在一起。这就是信息抽取的主要任务。信息以统一的形式集成在一起的好处是方便检查和比较。信息抽取技术并不试图全面理解整篇文档，只是对文档中包含相关信息的部分进行分析。至于哪些信息是相关的，那将由系统设计时定下的领域范围而定。

AAAI2021 | 图神经网络的异质图结构学习，Heterogeneous Graph Structure Learning for Graph Neural Networks

专知会员服务

92+阅读 · 2021年1月20日

【干货书】Python程序员编程，810页pdf，Python® for Programmers

专知会员服务

62+阅读 · 2020年8月6日

【快讯】ECCV 2020论文出炉，1361篇上榜，你的paper中了吗？

专知会员服务

57+阅读 · 2020年7月3日

【AAAI 2019】双曲异构信息网络嵌入，Hyperbolic Heterogeneous Information Network Embedding

专知会员服务

60+阅读 · 2020年6月28日