Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.
翻译:摘要:人体网格恢复(HMR)为各种现实世界应用(如游戏,人机交互和虚拟现实)提供了丰富的人体信息。与单张图像的方法相比,基于视频的方法可以利用时间信息通过使用人体运动先验知识进一步提高性能。然而,许多到许多的方法(例如VIBE)存在运动平滑性差和时间间歇性问题。而许多到一的方法(例如TCMR和MPS-Net)则依赖于未来帧,这是不因果关系的,在推理过程中效率低下。为解决这些问题,我们提出了一种新颖的基于扩散驱动的变压器框架(DDT)用于视频HMR。DDT旨在从输入序列中解码特定的运动模式,增强运动平滑性和时间一致性。作为一种多对多的方法,我们的DDT的解码器输出所有帧的人体网格,使其在对时间效率至关重要且需要因果模型的实际应用中更具可行性。在广泛使用的数据集(Human3.6M,MPI-INF-3DHP和3DPW)上进行了大量实验,证明了我们的DDT的有效性和效率。