Classifying the core textual components of a scientific paper-title, author, body text, etc.-is a critical first step in automated scientific document understanding. Previous work has shown how using elementary layout information, i.e., each token's 2D position on the page, leads to more accurate classification. We introduce new methods for incorporating VIsual LAyout structures (VILA), e.g., the grouping of page texts into text lines or text blocks, into language models to further improve performance. We show that the I-VILA approach, which simply adds special tokens denoting boundaries between layout structures into model inputs, can lead to +1~4.5 F1 Score improvements in token classification tasks. Moreover, we design a hierarchical model H-VILA that encodes these layout structures and record a up-to 70% efficiency boost without hurting prediction accuracy. The experiments are conducted on a newly curated evaluation suite, S2-VLUE, with a novel metric measuring VILA awareness and a new dataset covering 19 scientific disciplines with gold annotations. Pre-trained weights, benchmark datasets, and source code will be available at https://github.com/allenai/VILA}{https://github.com/allenai/VILA.
翻译:将科学论文标题、作者、正文等核心文字组成部分分类,是自动科学文件理解自动化科学文件的重要第一步。先前的工作已经表明如何使用基本布局信息,即页面上每个象征的2D位置,导致更准确的分类。我们采用了新方法,例如将页面文本分组成文本行或文本块,纳入语言模型,以进一步提高绩效。我们表明,I-VILA方法,仅仅将布局结构之间的分界线作为特殊标志添加到模式输入中,可以导致在象征性分类任务中使用+1~4.5 F1分的评分改进。此外,我们设计了一个H-VILA等级模型,将这些布局结构编码并记录高达70%的效率提升,但不会损害预测准确性。实验是在新整理的评估套件S2-VLUE上进行的,该套套新颖的衡量VILA认识和一套涵盖19个科学学科的新数据集,配有黄金说明。预加训练的重量、基准数据设置/AUB/A/源代码将可在 http://www.Lgis/Ang/Arubs.