The open-ended question answering task of Text-VQA often requires reading and reasoning about rarely seen or completely unseen scene-text content of an image. We address this zero-shot nature of the problem by proposing the generalized use of external knowledge to augment our understanding of the scene text. We design a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on three publicly available datasets, under the constraints of similar upstream OCR systems and training data.
翻译:Text-VQA 的开放式问题解答任务往往要求对很少看到或完全看不见的图像的现场文字内容进行解读和推理。我们建议普遍使用外部知识来增进我们对现场文字的理解,从而解决这一问题的零点性质。我们设计了一个框架,利用标准的多式联运变压器来提取、验证和解释知识,用于理解语言的视觉理解任务。我们通过经验证据和定性结果,展示外部知识如何能够突出只用实例的提示,从而帮助处理培训数据偏差,改进答方实体类型正确性,并发现多字名名称实体。我们在类似的上游OCR系统和培训数据的限制下,产生了与三种公开数据集最新技术相近的结果。