Incorporating external knowledge to Visual Question Answering (VQA) has become a vital practical need. Existing methods mostly adopt pipeline approaches with different components for knowledge matching and extraction, feature learning, etc.However, such pipeline approaches suffer when some component does not perform well, which leads to error propagation and poor overall performance. Furthermore, the majority of existing approaches ignore the answer bias issue -- many answers may have never appeared during training (i.e., unseen answers) in real-word application. To bridge these gaps, in this paper, we propose a Zero-shot VQA algorithm using knowledge graphs and a mask-based learning mechanism for better incorporating external knowledge, and present new answer-based Zero-shot VQA splits for the F-VQA dataset. Experiments show that our method can achieve state-of-the-art performance in Zero-shot VQA with unseen answers, meanwhile dramatically augment existing end-to-end models on the normal F-VQA task.
翻译:将外部知识纳入视觉问题解答(VQA)已成为一项至关重要的实际需要。现有方法大多采用含有不同组成部分的管道方法,用于知识匹配和提取、特征学习等。然而,当某些部件表现不佳时,这种管道方法就会受到影响,导致错误传播和总体性能不佳。此外,大多数现有方法忽视了答案偏差问题 -- -- 在实际应用中,许多答案可能从未出现在培训(即无形答案)中。为了弥合这些差距,我们在本文件中提议采用零射VQA算法,使用知识图表和基于掩码的学习机制,更好地纳入外部知识,并为F-VQA数据集提出基于回答的零射VQA分解。实验表明,我们的方法可以实现Zero-shot VQA的状态性能,同时以看不见的答案极大地增强常规F-VQA任务的现有端到端模式。