Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite the obvious resemblance between them, the two are treated independently, yielding task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on VQA and CAP by up to 3.49% and 0.7 CIDEr, respectively.
翻译:视觉问题解答(VQA)和图像描述(CAP)是最受欢迎的视觉语言任务之一,其相似的场景文本版本需要从图像中的文字中推理。 尽管两者之间有明显的相似性,但两者是单独处理的,产生任务特定的方法,既可以看,也可以看,但不能两者兼而有之。在这项工作中,我们对这一现象进行深入分析,并提出UNTNT,即统一文本-非文本方法,以提供现有多式联运结构的场景文本理解能力。具体地说,我们将现场文本信息作为一种额外方式处理,通过指定的模块将它与任何经过预先训练的编码器-解码器基础建筑混在一起。TNUTNT导致第一个成功处理两种任务类型的单一模型。此外,我们显示,现场文本理解能力可以提高VQA和CAP的视觉语言模型的性能,分别达到3.49%和0.7CIDER。