While a lot of work has been done on developing models to tackle the problem of Visual Question Answering, the ability of these models to relate the question to the image features still remain less explored. We present an empirical study of different feature extraction methods with different loss functions. We propose New dataset for the task of Visual Question Answering with multiple image inputs having only one ground truth, and benchmark our results on them. Our final model utilising Resnet + RCNN image features and Bert embeddings, inspired from stacked attention network gives 39% word accuracy and 99% image accuracy on CLEVER+TinyImagenet dataset.
翻译:虽然在开发解决视觉问题解答问题的模型方面做了大量工作,但这些模型将问题与图像特征联系起来的能力仍然没有那么深入探讨。 我们介绍了关于不同特征提取方法的经验研究,这些特征提取方法具有不同的损失功能。 我们提出了用于视觉问题任务的新数据集,用多个图像解答,只有一种地面真实性,并以这些图像作为我们结果的基准。我们最后使用Resnet + RCNN图像特征的模型和来自堆积式关注网络的Bert嵌入,在CLEWER+TinyImagenet数据集中提供了39%的字精度和99%的图像精度。