In multi-modal reasoning tasks, such as visual question answering (VQA), there have been many modeling and training paradigms tested. Previous models propose different methods for the vision and language tasks, but which ones perform the best while being sample and computationally efficient? Based on our experiments, we find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner. Empirical results demonstrate that this differentiable end-to-end program executor is able to maintain state-of-the-art accuracy while being sample and computationally efficient.
翻译:在多模式的推理任务中,比如视觉问答(VQA),已经测试了许多模型和培训模式。以前的模型提出了不同的愿景和语言任务方法,但哪些模式在抽样和计算效率的同时表现最佳?根据我们的实验,我们发现将文本作为概率程序和图像代表为对象级场景图,最能满足这些偏差。我们扩展了现有的模型,以利用这些软程序和场景图来以端对端方式培训问题答案对方。经验性结果表明,这一不同的端对端程序执行者既能保持最新水平的准确性,又能保持抽样和计算效率。