Visual question answering (VQA) is a task that combines both the techniques of computer vision and natural language processing. It requires models to answer a text-based question according to the information contained in a visual. In recent years, the research field of VQA has been expanded. Research that focuses on the VQA, examining the reasoning ability and VQA on scientific diagrams, has also been explored more. Meanwhile, more multimodal feature fusion mechanisms have been proposed. This paper will review and analyze existing datasets, metrics, and models proposed for the VQA task.
翻译:视觉问答(VQA)是一项结合计算机视觉和自然语言处理技术的任务,它要求模型根据视觉所载信息回答基于文本的问题,近年来,VQA的研究领域有所扩大,还进一步探讨了侧重于VQA的研究,审查了推理能力和科学图表方面的VQA。同时,还提出了更多的多式联运特征聚合机制,本文件将审查和分析为VQA任务提议的现有数据集、指标和模型。