Deep multimodal learning has achieved great progress in recent years. However, current fusion approaches are static in nature, i.e., they process and fuse multimodal inputs with identical computation, without accounting for diverse computational demands of different multimodal data. In this work, we propose dynamic multimodal fusion (DynMM), a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference. To this end, we propose a gating function to provide modality-level or fusion-level decisions on-the-fly based on multimodal features and a resource-aware loss function that encourages computational efficiency. Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach. For instance, DynMM can reduce the computation costs by 46.5% with only a negligible accuracy loss (CMU-MOSEI sentiment analysis) and improve segmentation performance with over 21% savings in computation (NYU Depth V2 semantic segmentation) when compared with static fusion approaches. We believe our approach opens a new direction towards dynamic multimodal network design, with applications to a wide range of multimodal tasks.
翻译:深度多模态学习近年来取得了巨大进展。然而,当前的融合方法在性质上是静态的,即,它们使用相同的计算处理和融合多模态输入,而不考虑不同多模态数据的不同计算要求。在本文中,我们提出了动态多模态融合 (DynMM),这是一种自适应融合多模态数据,并在推理期间生成数据相关的前向路径的新方法。为此,我们提出了门控函数,基于多模态特征提供模态级或融合级的即时决策,并提出了一种资源感知的损失函数,鼓励计算效率。在各种多模态任务上的结果表明,我们方法的效率和广泛适用性。例如,DynMM 可以将计算成本降低 46.5%,仅有可忽略的精度损失(CMU-MOSEI情感分析),并在与静态融合方法相比时,在语义分割中实现 21% 以上的计算节省(NYU Depth V2)。我们相信我们的方法开辟了一条新的动态多模态网络设计的方向,适用于各种任务。