We propose the inverse problem of Visual question answering (iVQA), and explore its suitability as a benchmark for visuo-linguistic understanding. The iVQA task is to generate a question that corresponds to a given image and answer pair. Since the answers are less informative than the questions, and the questions have less learnable bias, an iVQA model needs to better understand the image to be successful than a VQA model. We pose question generation as a multi-modal dynamic inference process and propose an iVQA model that can gradually adjust its focus of attention guided by both a partially generated question and the answer. For evaluation, apart from existing linguistic metrics, we propose a new ranking metric. This metric compares the ground truth question's rank among a list of distractors, which allows the drawbacks of different algorithms and sources of error to be studied. Experimental results show that our model can generate diverse, grammatically correct and content correlated questions that match the given answer.
翻译:我们提出视觉问题解答(iVQA)的反面问题,并探索其是否适合作为相对语言理解的基准。iVQA的任务是生成一个与特定图像和答案对应的问题。由于答案不如问题那样丰富,而且问题缺乏可学习的偏差,iVQA模型需要更好地了解图像是否成功,而不是VQA模型。我们作为一个多模式动态推论过程提出问题生成,并提议一个iVQA模型,该模型可以逐渐调整其关注焦点,同时根据部分生成的问题和答案进行调整。关于评估,除了现有的语言衡量标准外,我们建议一个新的等级衡量标准。这一衡量标准将地面真相问题的排名比作一个分散器清单,这样就可以对不同的算法和错误来源进行推算。实验结果表明,我们的模型可以产生与给定答案相匹配的多样化、语法正确和内容相关问题。