In this paper, we propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera. The key idea is to leverage high-level features linking first- and third-views in a joint embedding space. To learn such embedding space we introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives. We explicitly consider spatial- and motion-domain features, combined using a semi-Siamese architecture trained in a self-supervised fashion. Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos, without needing domain adaptation nor knowledge of camera parameters. We achieve significant improvement of egocentric 3D body pose estimation performance on two unconstrained datasets, over three supervised state-of-the-art approaches. Our dataset and code will be available for research purposes.
翻译:在本文中,我们提出一个新颖的方法来提升3D体的3D体构成对从一个可磨损的相机拍摄的视频中计算出的人的估计。 关键的想法是利用在联合嵌入空间中连接第一和第三视图的高层次特征。 要学习这种嵌入空间,我们引入了第123第三版,这是一套由近2 000个视频组成的新配对同步数据集,描述从第一和第三视角拍摄的人类活动。 我们明确考虑空间和运动域特征,同时使用一个半赛亚米斯结构,以自我监督的方式培训。 实验结果表明,与我们数据集共同学习的多视嵌入空间有助于从任意的单视自我中心视频中提取歧视性特征,而无需领域调整或了解相机参数。 我们大大改进了以自我为中心的 3D 机构在三个未受监管的状态方法上对两个未受监控的数据集的性能进行估计。 我们的数据集和代码将可用于研究目的。