Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations caused to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this paper. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of +4.5% on the Chalearn IsoGD dataset.Our code is available at https://github.com/zhoubenjia/MotionRGBD-PAMI.
翻译:移动识别是计算机愿景中一个充满希望的方向,但是由于数据不足和参数众多,对视频分类模型的培训比图像难得多。为了绕过这一点,有些工作努力探索来自 RGB-D 数据的多式联运提示。虽然在一定程度上改进了运动识别,但这些方法在以下方面仍然面临亚最佳情况:(一) 数据增强,即RGB-D 数据集的规模仍然有限,而且几乎没有努力探索新的视频数据增强战略;(二) 优化机制,即空间-时间缠绕的网络结构给空间-时间-时间错乱的建模带来更多挑战;以及(三) 跨模式知识融合,即,造成数据延迟整合。为了减轻这些回流,我们提议改进基于RGB-D的动作识别,从本文的数据和算法角度来看。首先,我们引入了一个新的视频数据增强方法,即空间-时间错乱的网络结构结构,给空间-时间错乱的信息建模模型带来更多挑战;以及(三) 跨模式知识融合的混合整合,作为IMLIMF 数据升级的升级化框架的升级, 向IMF 提供最新版本的更新的升级的升级的升级的升级的升级的版本数据识别数据识别数据识别工具,作为最新版本的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的版本。