Predicting 3D human poses in real-world scenarios, also known as human pose forecasting, is inevitably subject to noisy inputs arising from inaccurate 3D pose estimations and occlusions. To address these challenges, we propose a diffusion-based approach that can predict given noisy observations. We frame the prediction task as a denoising problem, where both observation and prediction are considered as a single sequence containing missing elements (whether in the observation or prediction horizon). All missing elements are treated as noise and denoised with our conditional diffusion model. To better handle long-term forecasting horizon, we present a temporal cascaded diffusion model. We demonstrate the benefits of our approach on four publicly available datasets (Human3.6M, HumanEva-I, AMASS, and 3DPW), outperforming the state-of-the-art. Additionally, we show that our framework is generic enough to improve any 3D pose prediction model as a pre-processing step to repair their inputs and a post-processing step to refine their outputs. The code is available online: \url{https://github.com/vita-epfl/DePOSit}.
翻译:预测现实世界情景中的3D人类构成,又称人类构成预测,不可避免地会受到来自不准确的3D构成估计和隐蔽的杂音的影响。为了应对这些挑战,我们提议一种基于传播的方法,可以预测噪音的观测结果。我们把预测任务定义为一个解密问题,即观测和预测都被视为包含缺失元素的单一序列(无论是在观测或预测地平线上)。所有缺失元素都被视为噪音,与我们有条件的传播模型脱去。为了更好地处理长期预测前景,我们提出了一个时间级级级的传播模型。我们展示了我们的方法在四个公开数据集(Henal3.6M、HumanEva-I、AMASS和3DPW)上的好处,这四个数据集的性能超过了最新技术。此外,我们显示我们的框架是通用的,足以改进任何3D构成预测模型的预处理步骤,作为修复其投入和后处理步骤以改进其产出。代码可在网上查阅:https://github.com/vita-ep/DePOSit}。</s>