Multimodal learning helps to comprehensively understand the world, by integrating different senses. Accordingly, multiple input modalities are expected to boost model performance, but we actually find that they are not fully exploited even when the multimodal model outperforms its uni-modal counterpart. Specifically, in this paper we point out that existing multimodal discriminative models, in which uniform objective is designed for all modalities, could remain under-optimized uni-modal representations, caused by another dominated modality in some scenarios, e.g., sound in blowing wind event, vision in drawing picture event, etc. To alleviate this optimization imbalance, we propose on-the-fly gradient modulation to adaptively control the optimization of each modality, via monitoring the discrepancy of their contribution towards the learning objective. Further, an extra Gaussian noise that changes dynamically is introduced to avoid possible generalization drop caused by gradient modulation. As a result, we achieve considerable improvement over common fusion methods on different multimodal tasks, and this simple strategy can also boost existing multimodal methods, which illustrates its efficacy and versatility. The source code is available at \url{https://github.com/GeWu-Lab/OGM-GE_CVPR2022}.
翻译:因此,多种投入模式有望提升模型性能,但我们实际上发现,即使多式联运模式优于单式对等模式,也并未充分利用这些模式。具体地说,在本文件中,我们指出,现有多式联运歧视模式,即所有模式的统一目标是设计出统一目标的,因此,由于在某些情形中另一种主导模式,例如吹风事件的声音、图片事件中的视觉等,这些模式有助于全面理解世界。为了减轻这种优化不平衡,我们提议在飞行梯度上调整,以适应性地控制每种模式的优化,办法是通过监测其对学习目标的贡献的差异。此外,动态引入额外高音,以避免因梯度调整而可能造成的普遍化下降。结果就是,我们对不同多式联运任务的共同融合方法有了相当大的改进,而这种简单战略也能推动现有的多式联运方法,说明其功效和多变性。源代码可在\urla-PR20/M22_GOG_GB_G_GB_G_Giorg_G_G_GUB_G_GUB_G_GQ_GQ_GIScom提供源代码。