In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.
翻译:在本文中,我们提出了一种新的多模态框架用于场景文本视觉问答(STVQA),该框架要求模型通过阅读图像中的场景文本来回答问题。除了文本或视觉对象可以独立存在之外,场景文本自然通过同时作为图像中的视觉对象和传达语义的语言对象来将文本和视觉模态连接在一起。与传统的STVQA模型不同,后者将场景文本中的语言语义和视觉语义视为两个独立的特征。在本文中,我们提出了“定位再生成”的范式(LTG),它使用边界框连接显式地统一这两个语义。LTG首先使用回答位置模块(ALM)定位可能包含回答单词的区域,该模块由区域提议网络和语言细化网络组成,两者皆可通过场景文本边框之间的一对一映射相互转换。接下来,给定ALM选择的答案单词,LTG使用基于预训练语言模型的答案生成模块(AGM)生成可读的答案序列。由于视觉和语言语义的显式对齐,即使没有任何基于场景文本的预训练任务,LTG在TextVQA数据集和ST-VQA数据集中分别比非预训练基线提高了+6.06%和+6.92%的绝对精度。我们进一步证明了LTG通过空间边框连接有效地统一了视觉和文本模态,这在先前的方法中被低估了。