Information extraction (IE) from visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which demonstrates great potential of pre-training methods. In this paper, we present a new approach to improve the capability of language model pre-training on VRDs. Firstly, we introduce a new IE model that is query-based and employs the span extraction formulation instead of the commonly used sequence labelling approach. Secondly, to further extend the span extraction formulation, we propose a new training task which focuses on modelling the relationships between semantic entities within a document. This task enables the spans to be extracted recursively and can be used as both a pre-training objective as well as an IE downstream task. Evaluation on various datasets of popular business documents (invoices, receipts) shows that our proposed method can improve the performance of existing models significantly, while providing a mechanism to accumulate model knowledge from multiple downstream IE tasks.
翻译:由于改编了基于变异器的语言模型,显示培训前方法的巨大潜力,从视觉丰富文件(VRDs)中提取信息(IE)最近取得了SOTA的绩效。在本文件中,我们提出了提高VRDs语言模型预培训能力的新办法。首先,我们采用了基于查询的新的IE模式,采用跨范围抽取方式,而不是常用的序列标签方法。第二,为了进一步扩大抽取方式,我们提议了一项新的培训任务,重点是在文件中模拟语义实体之间的关系。这一任务使这些区域能够反复提取,并可以用作培训前的目标和IE下游任务。对流行商业文件(发票、收据)的各种数据集的评估表明,我们提出的方法可以大大改进现有模型的性能,同时提供一个机制,从多个下游的 IE任务中积累模型知识。