Visual question answering (VQA) is a challenging task, which has attracted more and more attention in the field of computer vision and natural language processing. However, the current visual question answering has the problem of language bias, which reduces the robustness of the model and has an adverse impact on the practical application of visual question answering. In this paper, we conduct a comprehensive review and analysis of this field for the first time, and classify the existing methods according to three categories, including enhancing visual information, weakening language priors, data enhancement and training strategies. At the same time, the relevant representative methods are introduced, summarized and analyzed in turn. The causes of language bias are revealed and classified. Secondly, this paper introduces the datasets mainly used for testing, and reports the experimental results of various existing methods. Finally, we discuss the possible future research directions in this field.
翻译:视觉问题解答(VQA)是一项具有挑战性的任务,在计算机视觉和自然语言处理领域吸引了越来越多的注意力,然而,目前的视觉问题解答存在语言偏见问题,这降低了模型的坚固性,对视觉问题解答的实际应用产生了不利影响。在本文件中,我们首次对该领域进行了全面审查和分析,并将现有方法分为三类,包括加强视觉信息、削弱语言前科、增强数据和培训战略。与此同时,引入、总结和分析了相关的代表性方法。语言偏见的原因被揭示和分类。第二,本文介绍了主要用于测试的数据集,并报告了各种现有方法的实验结果。最后,我们讨论了该领域可能的未来研究方向。