人类卵胶的视图- 变量、封闭- 活性活性概率嵌入 (View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose)

from arxiv, Accepted to International Journal of Computer Vision (IJCV). Code is available at https://github.com/google-research/google-research/tree/master/poem. Video synchronization results are available at https://drive.google.com/corp/drive/folders/1nhPuEcX4Lhe6iK3nv84cvSCov2eJ52Xy. arXiv admin note: text overlap with arXiv:1912.01001

Recognition of human poses and actions is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints that make the recognition tasks challenging. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we adopt a probabilistic formulation for our embedding space. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We also show that by training a simple temporal embedding model, we achieve superior performance on pose sequence retrieval and largely reduce the embedding dimension from stacking frame-based embeddings for efficient large-scale retrieval. Furthermore, in order to enable our embeddings to work with partially visible input, we further investigate different keypoint occlusion augmentation strategies during training. We demonstrate that these occlusion augmentations significantly improve retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that using our embeddings without any additional training achieves competitive performance relative to other models specifically trained for each task.

翻译：对人造的认知和行为对于自主系统与人之间的顺利互动至关重要。然而,相机通常将2D的人类成像作为图像和视频捕捉成2D的图像和视频,这些图像和视频中的人造成的图像和视频可能具有显著的外观差异,使得识别任务具有挑战性。为此,我们探索如何认识3D人体的相似性,它来自2D的信息,而现有作品对此没有很好地研究。这里,我们提出一种方法,从 2D 的机体联合关键点中学习一个紧凑的视觉差异嵌入空间,但不明确预测3D 的构成。投影和隐蔽的2D 输入的模糊性难以通过确定性绘图来体现,因此,我们为嵌入空间采用了一种非常明显的概率公式。实验结果显示,我们嵌入模型在重新定位2个相近的图像中,与3D 的估测模型相比较。我们通过培训一个简单的时间嵌入模型,在配置基于基于框架的嵌入的嵌入层面上取得优优异性功能。此外,我们还具体地运用了对等的升级战略,以便进行深入的升级。我们进行进一步的升级,在进行进一步的升级中进行进一步的升级。我们进行进一步的升级,以便大大地展示,从而在演示中展示这些关键的升级,以展示,我们得以进行进一步的推进地展示,从而展示。