In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.
翻译:在本文中,我们提出一种不受监督的特征提取方法,以捕捉单视视频中的时间信息,在这个方法中,我们发现和编码每个框中感兴趣的主题,并利用对比性自我监督的自我监督(CSS)学习来提取丰富的潜质矢量。我们的方法不是简单地将附近框架的潜在特征视为正对,而将时间短的图像的隐性特征视为负对,就像其他 CSS 方法一样,我们明确地将每个潜在矢量分解成一个时间变化的成分和一个时间变化的成分。然后我们表明,将对比性损失仅仅应用到时间变化的特征,并鼓励在前后框架之间逐步过渡,同时重建输入、提取丰富的时间特征和适合人类姿势的估计。我们的方法比标准 CSS 战略减少大约50%的误差,比其他不受监督的单视方法超出其他单一视图方法,并且与多视技术的性能相匹配。在2D 外表出现时,我们的方法甚至可以提取更丰富的潜在特征,并改进3D 的估算准确性,比其他受严格监督的状态的方法要好。