We hypothesize that due to the greedy nature of learning in multi-modal deep neural networks, these models tend to rely on just one modality while under-fitting the other modalities. Such behavior is counter-intuitive and hurts the models' generalization, as we observe empirically. To estimate the model's dependence on each modality, we compute the gain on the accuracy when the model has access to it in addition to another modality. We refer to this gain as the conditional utilization rate. In the experiments, we consistently observe an imbalance in conditional utilization rates between modalities, across multiple tasks and architectures. Since conditional utilization rate cannot be computed efficiently during training, we introduce a proxy for it based on the pace at which the model learns from each modality, which we refer to as the conditional learning speed. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning. The proposed algorithm improves the model's generalization on three datasets: Colored MNIST, Princeton ModelNet40, and NVIDIA Dynamic Hand Gesture.
翻译:我们假设,由于在多式深神经网络中学习的贪婪性质,这些模型往往只依赖一种模式,而没有适应其他模式。这种行为是反直觉的,伤害了模型的概括性,正如我们从经验中观察到的那样。为了估计模型对每一种模式的依赖性,我们根据模型在获得它时的准确性来计算收益,除了另一种模式之外,我们把这一收益称为有条件的使用率。在实验中,我们始终观察到多种任务和结构在模式之间的有条件使用率不平衡。由于在培训期间无法有效地计算有条件的使用率,我们根据模型从每一种模式中学习的速度,我们称之为有条件的学习速度,我们建议一种算法,以平衡培训期间模式之间的有条件学习速度,并表明它确实解决了贪婪学习问题。拟议的算法改进了三个数据集(彩色的MNIST、普林斯顿模型40和NVIDIA动态手表)对模型的概括性。