Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert.
翻译:虽然变压器最近被用于人体构成序列模型,但相关方法依赖假地面真理来补充目前有限的培训数据,以学习这些模型。在本文中,我们引入了PoseBERT,这是一个通过蒙面模型对3D运动抓取(Mocap)数据进行充分培训的变压器模块。它简单、通用和多功能化,可以在任何基于图像的模型之上插入,以利用时间信息将其转换成基于视频的模型。我们展示了PoseBERT的变量,其投入从3D基本关键点到3D参数模型的旋转不等,无论是整个身体(SMPL)还是手(MANO),都使用这种3D参数抓取(Mocap)数据。由于PoseBERT培训是任务定型,因此该模型可以应用到若干任务,如在不作微调的情况下进行改进、未来预测或动作完成。我们的实验结果证实,将PoseBERT添加到不同状态/图像模型的顶端利用时间信息模型的顶端。我们展示时,可以持续地使用一个测试模型,同时进行测试。