Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.
翻译:标准多模式模型假定在培训和推断阶段使用同样的模式。 但是,在实践中,多模式模型运行的环境可能无法满足这种假设。 因此,如果在推论阶段缺少任何模式,其性能会急剧下降。 我们问: 我们如何培训一个对缺失模式具有强力的模型? 本文寻求一套多模式行动识别的良好做法, 特别关心在推论时间无法找到某些模式的情况下如何有效地规范模式。 首先, 我们研究在培训期间如何有效地规范该模式( 例如, 数据增强)。 其次, 我们调查关于强力到缺失模式的聚合方法: 我们发现, 以变压器为基础的聚合方法显示对缺失模式的稳健性强性强性强性强性强性强性强于加和调。 第三, 我们提出一个简单的模块网络, ActionMAE, 它通过随机丢弃模式特性来预测多模式的组合, 并尝试用其余模式特征来重建这些模式。 聚合这些良好做法, 我们建立一个模型,不仅在多模式行动识别中有效,而且甚至对模式缺乏的聚合方法缺乏强健性。 我们的模型/ 运行模式在缺少了竞争力。