Recent grid-based document representations like BERTgrid allow the simultaneous encoding of the textual and layout information of a document in a 2D feature map so that state-of-the-art image segmentation and/or object detection models can be straightforwardly leveraged to extract key information from documents. However, such methods have not achieved comparable performance to state-of-the-art sequence- and graph-based methods such as LayoutLM and PICK yet. In this paper, we propose a new multi-modal backbone network by concatenating a BERTgrid to an intermediate layer of a CNN model, where the input of CNN is a document image and the BERTgrid is a grid of word embeddings, to generate a more powerful grid-based document representation, named ViBERTgrid. Unlike BERTgrid, the parameters of BERT and CNN in our multimodal backbone network are trained jointly. Our experimental results demonstrate that this joint training strategy improves significantly the representation ability of ViBERTgrid. Consequently, our ViBERTgrid-based key information extraction approach has achieved state-of-the-art performance on real-world datasets.
翻译:BERTGrid等最近的基于网格的文件表述方式允许将文件文本和布局信息同时编码在2D特征地图中,这样可以直接利用最新图像分割和/或物体探测模型来从文件中获取关键信息,但是,这些方法没有达到与最新的最新序列法和基于图形的方法,如版图LM和PICK的类似性能。在本文中,我们提出一个新的多模式主干网,将一个BERTGrid转换成CNN模型的中间层,其中CNN输入的是文件图像,BERTGrid是一个文字嵌入网,以产生更强大的基于网格的文件代表,名为ViBERTGrid。与BERTrid不同,我们的多式联运主干网中BERT和CNN的参数是联合培训的。我们的实验结果表明,这一联合培训战略大大提高了ViBERTGrid的代表性。因此,我们的ViBERTrid关键信息提取方法在现实世界数据集中达到了最先进的性能。