Visual question answering (VQA) often requires an understanding of visual concepts and language semantics, which relies on external knowledge. Most existing methods exploit pre-trained language models or/and unstructured text, but the knowledge in these resources are often incomplete and noisy. Some methods prefer to use knowledge graphs (KGs) which often have intensive structured knowledge, but the research is still quite preliminary. In this paper, we propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection. To effectively incorporate an external KG, we transfer triples into text and propose a late injection mechanism. Finally we address VQA as a text generation task with an effective encoder-decoder paradigm. In the evaluation with OKVQA datasets, our method achieves state-of-the-art results.
翻译:视觉问题解答( VQA) 往往需要了解视觉概念和语言语义,这依赖于外部知识。 大部分现有方法都利用了预先训练的语言模型或/和无结构的文本,但这些资源的知识往往不完整和吵闹。 有些方法倾向于使用通常具有密集结构化知识的知识图表( KGs ), 但研究还是相当初步的。 在本文中, 我们提出 LaKo, 这是一种知识驱动的VQA 方法, 通过远程知识到文字的输入。 为了有效地吸收外部的 KG, 我们向文字传输了三倍, 并提出了一个延迟注入机制。 最后, 我们用有效的编码器解码模式将VQA作为文本生成任务。 在用 OKVQA 数据集进行评估时, 我们的方法达到了最新的结果 。