In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at: https://github.com/maoyunyao/CMD
翻译:在3D行动识别中,骨架模式之间有着丰富的互补信息。然而,如何建模和使用这一信息仍然是自我监督的3D行动代表学习的一个棘手问题。在这项工作中,我们将跨模式互动作为一种双向知识蒸馏问题进行设计。不同于传统的蒸馏解决方案,在这项工作中,将固定和预先培训的教师的知识转让给学生,知识不断更新,并在各种模式之间双向蒸馏。为此,我们提出一个新的跨模式相互蒸馏框架,设计如下。一方面,采用相邻性分布,以模拟在每种模式中学习的知识,其中关联性信息自然适合对比性框架。另一方面,对教师和学生采用不对称的配置,以稳定蒸馏进程,并在各种模式之间传递高信任信息。我们从中推断出,以往工作中的跨模式积极开采可被视为我们CMD的变型版本。我们在NTUB RGB+D系列中进行了广泛的实验,在RGB+D系列中,在NTUG+RGB新版本数据中,在现有的120MD系列中进行了广泛的实验。