A crucial component in the curation of KB for a scientific domain is information extraction from tables in the domain's published articles -- tables carry important information (often numeric), which must be adequately extracted for a comprehensive machine understanding of an article. Existing table extractors assume prior knowledge of table structure and format, which may not be known in scientific tables. We study a specific and challenging table extraction problem: extracting compositions of materials (e.g., glasses, alloys). We first observe that materials science researchers organize similar compositions in a wide variety of table styles, necessitating an intelligent model for table understanding and composition extraction. Consequently, we define this novel task as a challenge for the ML community and create a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present DiSCoMaT, a strong baseline geared towards this specific task, which combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DiSCoMaT outperforms recent table processing architectures by significant margins.
翻译:科学领域KB的缩略语中的一个关键组成部分是从领域发表的文章中的表格中提取信息 -- -- 表格含有重要信息(通常是数字),必须充分提取这些信息,以便机器全面理解一篇文章。现有的表格提取器假定事先了解表格结构和格式,科学表格中可能不为人所知。我们研究一个具体而具有挑战性的表格提取问题:材料(例如眼镜、合金)的成分的提取。我们首先观察到材料科学研究员以广泛的表格样式组织类似的构成,需要一种智能模型来理解和提取表格。因此,我们把这一新颖的任务界定为对ML社区的挑战,并创建由4 408个远处监督的表格组成的培训数据集,以及1 475个手动注释和测试表格。我们还介绍了DisCoMaT,这是针对这一具体任务的强有力的基线,它将多个图形神经网络与若干特定常规表达、特征和制约结合起来。我们显示DiscoMaT比最近的表格处理结构有显著的边距。