In this paper we aim to answer questions based on images when provided with a dataset of question-answer pairs for a number of images during training. A number of methods have focused on solving this problem by using image based attention. This is done by focusing on a specific part of the image while answering the question. Humans also do so when solving this problem. However, the regions that the previous systems focus on are not correlated with the regions that humans focus on. The accuracy is limited due to this drawback. In this paper, we propose to solve this problem by using an exemplar based method. We obtain one or more supporting and opposing exemplars to obtain a differential attention region. This differential attention is closer to human attention than other image based attention methods. It also helps in obtaining improved accuracy when answering questions. The method is evaluated on challenging benchmark datasets. We perform better than other image based attention methods and are competitive with other state of the art methods that focus on both image and questions.
翻译:在本文中,我们的目标是在为一些图像提供一组问答数据集时,回答基于图像的问题。一些方法侧重于通过利用图像关注来解决这一问题。在回答问题时,重点是图像的一个特定部分。人类也这样做。然而,以前系统关注的区域与人类关注的区域没有关联。由于这一缺陷,准确性有限。我们在本文件中建议使用一个基于实例的方法来解决这一问题。我们获得了一个或多个支持和反对示例,以获得不同的关注区域。这种关注比基于图像的其他方法更接近于人类关注区域。在回答问题时,还有助于提高准确性。该方法以具有挑战性的基准数据集来评估。我们比其他基于关注的方法要好,并且与其他侧重于图像和问题的艺术方法的状态相比,我们表现得更好,并且具有竞争力。