Visual Question Answering (VQA) is a complex task requiring large datasets and expensive training. Neural Module Networks (NMN) first translate the question to a reasoning path, then follow that path to analyze the image and provide an answer. We propose an NMN method that relies on predefined cross-modal embeddings to ``warm start'' learning on the GQA dataset, then focus on Curriculum Learning (CL) as a way to improve training and make a better use of the data. Several difficulty criteria are employed for defining CL methods. We show that by an appropriate selection of the CL method the cost of training and the amount of training data can be greatly reduced, with a limited impact on the final VQA accuracy. Furthermore, we introduce intermediate losses during training and find that this allows to simplify the CL strategy.
翻译:视觉问答(VQA)是一项复杂任务,需要大规模数据集和昂贵的训练。神经模块网络(NMN)首先将问题转换为推理路径,然后沿着该路径分析图像并提供答案。我们提出了一种NMN方法,该方法依赖于预定义的跨模态嵌入来“热启动”GQA数据集上的学习,然后专注于课程学习(CL)作为改进训练和更好地利用数据的方法。多个难度标准用于定义CL方法。我们展示了通过适当选择CL方法,可以大大降低训练成本和训练数据量,从而对最终的VQA准确性产生有限的影响。此外,我们在训练期间引入了中间损失,并发现这可以简化CL策略。