The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic `common sense' questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and enabling robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA models, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions.
翻译:视觉问题解答(VQA)任务将处理数据的挑战与视觉和语言处理结合起来,以回答关于给定图像的基本“常识”问题。鉴于图像和自然语言的问题,VQA系统试图利用图像的视觉元素和从文本问题中收集的推论找到正确的答案。在这次调查中,我们覆盖并讨论VQA域最近发布的数据集,涉及各类问题-格式和机器学习模型的稳健性。接下来,我们讨论新的深层次学习模型,这些模型在VQA数据集中显示了有希望的结果。最后,我们介绍并讨论我们在Vanilla VQA模型、Stacked注意网络和VQA 2017挑战赢家模型中计算的一些结果。我们还提供详细分析,同时提出挑战和未来研究方向。