We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
翻译:我们介绍了基于多式变压器的DocFormer(一个基于多式变压器的架构),用以完成视觉文档理解(VDU)的任务。VDU是一个具有挑战性的问题,目的是以不同格式(形式、收据等)和布局来理解文件。此外,DocFormer(DocFormer)使用精心设计、鼓励多式互动的任务,在没有监督的情况下接受了预先培训。DocFormer(DocFormer)使用文字、视觉和空间特征,并使用新型的多式自省层进行合并。DocFormer(DocFormer)还分享了各种模式之间的空间嵌入,使模型更容易将文字与视觉符号联系起来,反之亦然。DocFormer(DocFormer)在4个不同的数据集上进行了评估,每个数据集都有很强的基线。DocFormer(DocFormer)在所有这些数据集上都取得了最新的结果,有时击打4x 其大小(没有参数)。