The primary focus of recent work with largescale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a different question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a novel model - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6 points absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an end to end encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. An additional benefit of explicit knowledge integration is seen in improved interpretability of model predictions in our analysis.
翻译:最近与大型变压器合作的主要重点是优化纳入模型参数的信息量。在这项工作中,我们提出了一个不同的问题:多式联运变压器能否在推理中利用明确的知识?现有的方法,主要是单式方法,在知识检索范式下探索了方法,然后是答案预测,但对于所利用知识的质量和相关性,以及如何将隐含和明确知识的推理过程结合起来,没有提出任何问题。为了应对这些挑战,我们提出了一个新颖的模型,即知识增强变压器(KAT),该模型能够对OK-VQA的开放式多式任务产生强有力的最新结果(+6点绝对值 )。我们的方法将隐含和明确知识整合到最终的编码-解码结构中,同时在生成答案时仍然对这两种知识来源进行联合推理。在分析中改进模型预测的可解释性方面,可以看到明确的知识整合的另一个好处。