Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any state-of-the-art image-based model in order to transform it in a video-based model leveraging temporal information. Our experimental results show that the proposed approaches reach state-of-the-art performance on various datasets including 3DPW, MPI-INF-3DHP, MuPoTS-3D, MCB and AIST. Test code and models will be available soon.
翻译:本文的目标是研究3D运动捕获(Mocap)数据是否可用于改进基于图像的和视频的人类网状恢复方法。我们发现,使用来自MoCap数据的合成图像成像的微调图像模型可以通过提供更广泛的成像、纹理和背景来提升其性能。事实上,我们实验结果表明,仅仅微调该模型的批次正常化层就足以实现巨大收益。我们进一步研究了将MCAP数据用于视频,并引入了PoseBERT,这是一个直接反转成形参数的变压器模块,通过蒙面模型进行训练。我们发现,使用MCap数据合成成像成像的微调图像模型可以提高其性能,为其提供更广泛的成像、质素和背景。我们实验结果显示,拟议的方法将达到各种数据集的状态性能,包括3DP-3MSTI、MP-3INF和MPO-DSBR。