In this work, we propose a deep neural architecture that uses an attention mechanism which utilizes region based image features, the natural language question asked, and semantic knowledge extracted from the regions of an image to produce open-ended answers for questions asked in a visual question answering (VQA) task. The combination of both region based features and region based textual information about the image bolsters a model to more accurately respond to questions and potentially do so with less required training data. We evaluate our proposed architecture on a VQA task against a strong baseline and show that our method achieves excellent results on this task.
翻译:在这项工作中,我们提出一个深层神经结构,利用一种关注机制,利用区域图像特征、自然语言问题和从图像区域提取的语义知识,为视觉问题回答(VQA)任务中提出的问题提供开放式答案。基于区域特征和基于区域图像的文字信息相结合,支持了更准确地回答问题的模式,并有可能用较少需要的培训数据这样做。我们根据强有力的基线评估了我们关于VQA任务的拟议架构,并表明我们的方法在这项任务上取得了极佳的成果。