Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at https://github.com/Vegetebird/MHFormer.
翻译:由于深度的模糊性和自我封闭性,估算单体视频3D人的外形是一个具有挑战性的任务。大多数现有工作试图通过利用空间和时间关系来解决这两个问题。然而,这些工作忽略了这样一个事实,即如果存在多种可行的解决方案(即假设),它是一个反的问题。为了缓解这一限制,我们提议了一个多双向变形器(MHFormer),它学习多种貌似似假象的空洞-时空表达方式。为了有效地模拟多双向依赖关系,并在假设特征之间建立牢固的关系,任务分为三个阶段:(一) 生成多个初始假设演示;(二) 模型自我合制通信,将多个假说合并成单一的趋同代表,然后将它分成若干不同的假说;(三) 学习跨双向通信,并汇总多双向特征,以合成最后的3D构成。通过上述进程,最终代表得到加强,而合成的组合式模型分为三个阶段:(一) 生成多个初步假设性模型(MFM-M-3),其模拟结果不具有挑战性。