Audio-visual learning helps to comprehensively understand the world by fusing practical information from multiple modalities. However, recent studies show that the imbalanced optimization of uni-modal encoders in a joint-learning model is a bottleneck to enhancing the model's performance. We further find that the up-to-date imbalance-mitigating methods fail on some audio-visual fine-grained tasks, which have a higher demand for distinguishable feature distribution. Fueled by the success of cosine loss that builds hyperspherical feature spaces and achieves lower intra-class angular variability, this paper proposes Multi-Modal Cosine loss, MMCosine. It performs a modality-wise $L_2$ normalization to features and weights towards balanced and better multi-modal fine-grained learning. We demonstrate that our method can alleviate the imbalanced optimization from the perspective of weight norm and fully exploit the discriminability of the cosine metric. Extensive experiments prove the effectiveness of our method and the versatility with advanced multi-modal fusion strategies and up-to-date imbalance-mitigating methods.
翻译:视听学习通过利用多种模式的实用信息有助于全面理解世界。然而,最近的研究表明,在联合学习模式中,单式编码器的不平衡优化是提高模型性能的一个瓶颈。我们进一步发现,最新的减少不平衡的方法在一些视听细微任务上失败了,这些微小任务对可辨别特征分布的需求较高。在建立超球特性空间和实现较低水平三角变异的连线损失的成功推动下,本文提出多式多式孔雀损失,MMCosine。对平衡和更好的多式微微小学习的特征和重量进行模式化的2美元正常化。我们证明,我们的方法能够从重量标准的角度减轻不平衡的优化,充分利用可辨度测量的可度的可变性。广泛的实验证明了我们的方法的有效性,以及以先进的多式融合策略和最新的反不平衡方法的多功能性。</s>