Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.
翻译:标准的多模态模型假定在训练和推理阶段使用相同的模态。然而,在实践中,多模态模型运行的环境可能无法满足此类假设。因此,如果任何一种模态在推理阶段丢失,它们的性能会急剧下降。我们的研究问题是:如何训练一个对缺失模态表现稳健的模型?本文提出一组多模态动作识别的良好实践,尤其关注缺失模态的情况。首先,我们研究如何在训练期间有效地规范模型(例如,数据增强)。其次,我们探讨了融合方法来解决缺失模态的鲁棒性问题:我们发现基于Transformer的融合比求和或串联表现更好。第三,我们提出了一种简单的模块化网络ActionMAE,通过随机删除模态特征并尝试用剩余模态特征重建它们来学习缺失模态预测编码。结合这些良好实践,我们构建了一个不仅在多模态动作识别方面有效而且对模态缺失具有鲁棒性的模型。我们的模型在多个基准测试中都达到了最新水平,并且即使在缺失模态的情况下也能保持竞争性能。源代码可在https://github.com/sangminwoo/ActionMAE 上获取。