In multi-modal action recognition, it is important to consider not only the complementary nature of different modalities but also global action content. In this paper, we propose a novel network, named Modality Mixer (M-Mixer) network, to leverage complementary information across modalities and temporal context of an action for multi-modal action recognition. We also introduce a simple yet effective recurrent unit, called Multi-modal Contextualization Unit (MCU), which is a core component of M-Mixer. Our MCU temporally encodes a sequence of one modality (e.g., RGB) with action content features of other modalities (e.g., depth, IR). This process encourages M-Mixer to exploit global action content and also to supplement complementary information of other modalities. As a result, our proposed method outperforms state-of-the-art methods on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets. Moreover, we demonstrate the effectiveness of M-Mixer by conducting comprehensive ablation studies.
翻译:在多模式行动承认方面,重要的是不仅要考虑不同模式的补充性质,还要考虑全球行动内容。在本文件中,我们提议建立一个名为“Modality Mixer(M-Mixer)网络”的新颖网络,以利用多种模式行动承认行动的方式和时间范围内的补充信息。我们还引入了一个简单而有效的经常性单位,称为“多模式背景单位”(MCU),这是M-Mixer的核心组成部分。我们的MCU暂时编码了一种模式(例如RGB)的序列,具有其他模式(例如深度、IR)的行动内容特征。这一进程鼓励M-Mixer利用全球行动内容,并补充其他模式的补充信息。结果,我们提出的方法超越了NTU RGB+D 60、NTU RGB+D 120和NW-ULAC数据集的状态方法。此外,我们通过进行全面的断层研究,展示了M-Mixer的有效性。