In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images. Specifically, KVPFormer first identifies key entities from all entities in an image with a Transformer encoder, then takes these key entities as \textbf{questions} and feeds them into a Transformer decoder to predict their corresponding \textbf{answers} (i.e., value entities) in parallel. To achieve higher answer prediction accuracy, we propose a coarse-to-fine answer prediction approach further, which first extracts multiple answer candidates for each identified question in the coarse stage and then selects the most likely one among these candidates in the fine stage. In this way, the learning difficulty of answer prediction can be effectively reduced so that the prediction accuracy can be improved. Moreover, we introduce a spatial compatibility attention bias into the self-attention/cross-attention mechanism for \Ours{} to better model the spatial interactions between entities. With these new techniques, our proposed \Ours{} achieves state-of-the-art results on FUNSD and XFUND datasets, outperforming the previous best-performing method by 7.2\% and 13.2\% in F1 score, respectively.
翻译:本文提出了一种名为KVPFormer的新型基于问答模型的键值对提取方法,可从形式文档图像中鲁棒地提取实体之间的键值关系。具体而言,KVPFormer首先使用Transformer编码器从图像中所有实体中识别出关键实体,然后将这些关键实体作为“问题”,并同时输入到Transformer解码器中以预测它们相应的“答案”(即值实体)。为了提高答案预测的准确性,我们提出了一种粗-细粒度答案预测方法,该方法首先在粗略阶段为每个已识别的问题提取多个答案候选项,然后在精细阶段中从这些候选项中选择可能性最高的一个来作为答案。通过这种方式,有效降低了答案预测的学习难度,提高了预测准确性。此外,我们还在\Ours{} 的自注意/交注意机制中引入了一种空间兼容性注意力偏置,以更好地建模实体之间的空间交互作用。借助这些新技术,我们提出的\Ours{} 在FUNSD和XFUND数据集上取得了最先进的结果,分别比先前表现最佳的方法高出7.2\%和13.2\%的F1得分。