Most TextVQA approaches focus on the integration of objects, scene texts and question words by a simple transformer encoder. But this fails to capture the semantic relations between different modalities. The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA, which reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image. We created a guided-attention module to capture the intra-modal interplay between the language and the vision as a guidance for inter-modal interactions. To make explicit teaching of the relations between the two modalities, we proposed and integrated two attention modules, namely a scene graph-based semantic relation-aware attention and a positional relation-aware attention. We conducted extensive experiments on two benchmark datasets, Text-VQA and ST-VQA. It is shown that our SceneGATE method outperformed existing ones because of the scene graph and its attention modules.
翻译:多数 TextVQA 方法都侧重于由简单的变压器编码器整合对象、现场文本和问题词。 但是这未能捕捉不同模式之间的语义关系。 本文建议为 TextVQA 建立一个基于 SceenGATE 的场景图形共同关注网络( SceeneGATE ), 以显示对象、 光学字符识别符号和问题单词之间的语义关系。 基于 TextVQA 的场景图形可以发现图像的基本语义。 我们创建了一个引导注意模块, 以捕捉语言和愿景之间的模式内相互作用, 作为模式间互动的指南。 为了明确教授两种模式之间的关系, 我们提议并整合了两个关注模块, 即基于图像图形的语义关系识别符号和定位关系识别单词。 我们对两个基准数据集( Text- VQA 和ST- VQA) 进行了广泛的实验。 我们的 SceneGATE 方法由于图像图及其关注模块而超越了现有模式。