We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses. The method trains a network using cross-view mutual information maximization (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints in a contrastive learning manner. We further propose two regularization terms to ensure disentanglement and smoothness of the learned representations. The resulting pose representations can be used for cross-view action recognition. To evaluate the power of the learned representations, in addition to the conventional fully-supervised action recognition settings, we introduce a novel task called single-shot cross-view action recognition. This task trains models with actions from only one single viewpoint while models are evaluated on poses captured from all possible viewpoints. We evaluate the learned representations on standard benchmarks for action recognition, and show that (i) CV-MIM performs competitively compared with the state-of-the-art models in the fully-supervised scenarios; (ii) CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting; (iii) and the learned representations can significantly boost the performance when reducing the amount of supervised training data.
翻译:我们采用了一种新的代表性学习方法,以解开2D人造面部依赖和视觉依赖的外表因素。该方法用交叉视图相互信息最大化(CV-MIM)来培训一个网络,使从不同角度从不同角度以不同的学习方式提供的相同外表的相互信息最大化。我们进一步建议两个正规化术语,以确保所学的外表的分解和平稳。由此产生的外表表现可用于交叉视图行动识别。我们除了评估常规的完全监督的行动识别设置外,还引入了名为单镜交叉行动识别的新任务。这一任务只从一个角度对模型进行行动培训,而模型则从所有可能的观点中采集。我们评估关于行动识别标准基准的学习性表述,并表明:(一) CV-MIM与完全监督情景中的最新模式相比,具有竞争力;(二) CV-MIM比其他竞合方法,在单一截面跨视图设置中大幅度地显示。(三) 当监督性表现能够大大提升数据时,监督性表现能够大大提升数据的数量。