In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Finally, the outputs from the above modules are processed by a global-local attentional answering module to produce an answer splicing together tokens from both OCR and general vocabulary iteratively by following M4C. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP. Demonstrating strong reasoning ability, it also won first place in TextVQA Challenge 2020. We extensively test different OCR methods on several reasoning models and investigate the impact of gradually increased OCR performance on TextVQA benchmark. With better OCR results, different models share dramatic improvement over the VQA accuracy, but our model benefits most blessed by strong textual-visual reasoning ability. To grant our method an upper bound and make a fair testing base available for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release. The code and ground-truth OCR annotations for the TextVQA dataset are available at https://github.com/ChenyuGAO-CS/SMA
翻译:在本文中,我们提出一个端到端结构的多式联运关注神经网络(SMA),主要解决以上头两个问题。 SMA首先使用结构图示显示结构图,将图像中出现的对象对象、对象文本和文字关系编码,然后设计一个多式图注意网络,以了解其中的道理。最后,上述模块的产出由一个全球-地方注意回答模块处理,以生成来自OCR的答案拼接符号,并在M4C之后迭接通用词汇中生成一个答案拼接符号。我们提议的模型优于TextVQA数据集的 SoTA模型和ST-VQA数据集,除了基于TAP的预培训外,所有模型中的St-VQA数据集的两项任务。展示了强大的推理能力,并在TextVA挑战2020中赢得了第一位。我们广泛测试了不同的OCR方法,并调查了逐渐提高OCR绩效对TextVQA基准的影响。有了更好的OCRA结果,不同的模型比VA的准确性改进了VQA的准确性,但我们的模型最有强大的文本-视觉推理能力。