Visual question answering (VQA) often requires an understanding of visual concepts and language semantics, which relies on external knowledge. Most existing methods exploit pre-trained language models or/and unstructured text, but the knowledge in these resources are often incomplete and noisy. Some other methods prefer to use knowledge graphs (KGs) which often have intensive structured knowledge, but the research is still quite preliminary. In this paper, we propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection. To effectively incorporate an external KG, we transfer triples into textual format and propose a late injection mechanism for knowledge fusion. Finally we address VQA as a text generation task with an effective encoder-decoder paradigm, which achieves state-of-the-art results on OKVQA dataset.
翻译:视觉问题解答( VQA) 往往需要理解视觉概念和语言语义,这依赖于外部知识。 大部分现有方法都利用了预先训练的语言模型或/和无结构的文本,但这些资源的知识往往不完整和吵闹。 其他一些方法倾向于使用通常具有密集结构化知识的知识图表( KGs ), 但研究还是相当初步的。 在本文中, 我们提出 LaKo, 这是一种知识驱动的VQA 方法, 通过远程知识到文字的输入。 为了有效地纳入外部的 KG, 我们将三重转换成文本格式, 并提议一个用于知识融合的延迟注入机制。 最后我们用有效的编码器解码器解码器模式将VQA作为文本生成任务, 从而在 CPQA 数据集上实现最新的结果 。