Learning view-invariant representation is a key to improving feature discrimination power for skeleton-based action recognition. Existing approaches cannot effectively remove the impact of viewpoint due to the implicit view-dependent representations. In this work, we propose a self-supervised framework called Focalized Contrastive View-invariant Learning (FoCoViL), which significantly suppresses the view-specific information on the representation space where the viewpoints are coarsely aligned. By maximizing mutual information with an effective contrastive loss between multi-view sample pairs, FoCoViL associates actions with common view-invariant properties and simultaneously separates the dissimilar ones. We further propose an adaptive focalization method based on pairwise similarity to enhance contrastive learning for a clearer cluster boundary in the learned space. Different from many existing self-supervised representation learning work that rely heavily on supervised classifiers, FoCoViL performs well on both unsupervised and supervised classifiers with superior recognition performance. Extensive experiments also show that the proposed contrastive-based focalization generates a more discriminative latent representation.
翻译:视角不变的表征学习是提高基于骨骼的动作识别中特征区分能力的关键。现有方法由于隐含的视角相关表征效果不佳。在本文中,我们提出了一种自监督框架,称之为聚焦对比视图不变化学习(FoCoViL),该框架显着抑制了表征空间中视点特定信息的影响,其中视角粗略对齐。通过最大化跨多视图样本对之间的有效对比损失的互信息,FoCoViL将动作与共同的视角不变特性进行关联,并同时分离不相似的特性。我们进一步提出了一种基于成对相似性的自适应聚焦方法,以增强对比学习,以便在学习空间中更清晰地生成聚类边界。不同于许多现有的自监督表示学习方法,FoCoViL在无监督和有监督分类器上表现良好,并具有更好的识别性能。大量实验还表明,所提出的基于对比的聚焦生成了更具区分性的潜在表征。