Current work on Visual Question Answering (VQA) explore deterministic approaches conditioned on various types of image and question features. We posit that, in addition to image and question pairs, other modalities are useful for teaching machine to carry out question answering. Hence in this paper, we propose latent variable models for VQA where extra information (e.g. captions and answer categories) are incorporated as latent variables, which are observed during training but in turn benefit question-answering performance at test time. Experiments on the VQA v2.0 benchmarking dataset demonstrate the effectiveness of our proposed models: they improve over strong baselines, especially those that do not rely on extensive language-vision pre-training.
翻译:目前关于视觉问题解答(VQA)的工作探索以各种图像和问题特征为条件的决定性方法。我们假设,除了图像和问题配对外,其他模式对教学机器进行问题解答是有用的。因此,在本文中,我们提出了VQA的潜在变数模型,其中将额外信息(如说明和答案类别)作为潜在变量,在培训期间观察到,但在测试时则观察到,这反过来又有利于回答问题。 VQA v2.0基准数据集实验表明我们提议的模型的有效性:它们比强的基线,特别是那些不依赖广泛语言预培训的基线改进。