Making each modality in multi-modal data contribute is of vital importance to learning a versatile multi-modal model. Existing methods, however, are often dominated by one or few of modalities during model training, resulting in sub-optimal performance. In this paper, we refer to this problem as modality bias and attempt to study it in the context of multi-modal classification systematically and comprehensively. After stepping into several empirical analysis, we recognize that one modality affects the model prediction more just because this modality has a spurious correlation with instance labels. In order to primarily facilitate the evaluation on the modality bias problem, we construct two datasets respectively for the colored digit recognition and video action recognition tasks in line with the Out-of-Distribution (OoD) protocol. Collaborating with the benchmarks in the visual question answering task, we empirically justify the performance degradation of the existing methods on these OoD datasets, which serves as evidence to justify the modality bias learning. In addition, to overcome this problem, we propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned according to the training set statistics. Thereafter, we apply this method on eight baselines in total to test its effectiveness. From the results on four datasets regarding the above three tasks, our method yields remarkable performance improvements compared with the baselines, demonstrating its superiority on reducing the modality bias problem.
翻译:在多模式数据中,每个模式都有助于多模式数据,对于学习多功能多模式模式至关重要。但现有方法往往在模式培训期间以一种或几种模式为主,导致业绩低于最佳水平。在本文中,我们将此问题称为模式偏向,并试图在多模式分类背景下对其进行系统和全面的研究。在进行若干经验分析后,我们认识到,一种模式更能影响模型预测,因为模式与实例标签有虚假的关联。为了主要便利对模式偏向问题的评价,我们根据《差异数字识别和视频行动识别协议》,分别为彩色数字识别和视频行动识别任务建立了两套数据集。与视觉问题回答任务的基准合作,我们从经验上证明这些OOD数据集现有方法的绩效退化,作为模式偏差学习的证明。此外,为了克服这一问题,我们建议了一种插件和套件损失功能功能功能功能功能,据此,每个标签的特征空间从培训定值中适应到培训定值的统计。此后,我们运用了这四种基准方法,用以测试我们衡量其业绩基准的四项基准。