Constructing a large-scale labeled dataset in the real world, especially for high-level tasks (eg, Visual Question Answering), can be expensive and time-consuming. In addition, with the ever-growing amounts of data and architecture complexity, Active Learning has become an important aspect of computer vision research. In this work, we address Active Learning in the multi-modal setting of Visual Question Answering (VQA). In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition through the use of ad hoc single-modal branches for each input to leverage its information. Our mutual information based sample acquisition strategy Single-Modal Entropic Measure (SMEM) in addition to our self-distillation technique enables the sample acquisitor to exploit all present modalities and find the most informative samples. Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks. We confirm our findings on various VQA datasets through state-of-the-art performance by comparing to existing Active Learning baselines.
翻译:此外,随着数据和结构复杂性的不断增加,积极学习已成为计算机视觉研究的一个重要方面。在这项工作中,我们在视觉问答(VQA)的多模式设置中处理积极学习问题。根据多模式投入、图像和问题,我们提出了一个新颖的方法,通过使用每种输入的特设单一模式分支来利用它的信息来有效获取样本。我们基于信息的相互信息采集战略,除了我们的自我蒸馏技术外,还使样本采集器能够利用所有现有模式并找到最丰富的样本。我们的新想法是易于实施、具有成本效益和易于适应其他多模式任务。我们通过比较现有的积极学习基线,确认我们通过最新业绩对各种VQA数据集的调查结果。