Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively.
翻译:视觉问答(VQA)和图像字幕生成(CAP)是最受欢迎的视觉语言任务之一,具有类似的场景-文本版本,需要从图像中的文本进行推理。尽管它们显然相似,但两者被独立处理,而且正如我们所展示的,会产生特定于任务的方法,可以看到或阅读,但不能兼而有之。在这项工作中,我们对这种现象进行了深入分析,并提出了UniTNT,一种统一的文本-非文本方法,它授予了现有多模式架构场景文本理解能力。具体而言,我们将场景文本信息视为一个附加的模态,通过指定的模块将其与任何预训练的编码器-解码器的架构融合。彻底的实验表明,UniTNT导致了第一个成功处理两种任务类型的单一模型。此外,我们表明,场景-文本理解能力可以将视觉语言模型在一般VQA和CAP上的表现提高2.69%和0.6 CIDEr。