The open-ended question answering task of Text-VQA requires reading and reasoning about local, often previously unseen, scene-text content of an image to generate answers. In this work, we propose the generalized use of external knowledge to augment our understanding of the said scene-text. We design a framework to extract, validate, and reason with knowledge using a standard multimodal transformer for vision language understanding tasks. Through empirical evidence and qualitative results, we demonstrate how external knowledge can highlight instance-only cues and thus help deal with training data bias, improve answer entity type correctness, and detect multiword named entities. We generate results comparable to the state-of-the-art on two publicly available datasets, under the constraints of similar upstream OCR systems and training data.
翻译:Text-VQA 的开放式问题解答任务要求阅读和推理一个图像的本地(通常以前不为人知)的现场文字内容,以得出答案。在这项工作中,我们提议普遍使用外部知识,以增进我们对现场文字的理解。我们设计一个框架,利用标准的多式联运变压器来提取、验证和理性,用于理解语言的愿景任务。通过经验证据和定性结果,我们证明外部知识如何能突出只用实例的提示,从而帮助处理培训数据偏差,改进答题实体类型正确性,并发现多字名实体。我们在类似的上游OCR系统和培训数据的限制下,产生了与两个公开可获取的数据集相近的结果。