Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.
翻译:由于深度模糊和自我封闭,从单体视频中估算 3D 人的外形是一项艰巨的任务。 大部分现有工作试图通过利用空间和时间关系来解决这两个问题。 但是, 这些工作忽略了这样一个事实, 即如果存在多种可行的解决方案( 假设), 就会出现反向问题。 为了缓解这一限制, 我们提议了一个多功能变形器( MH\Former ), 来学习多种貌似假象的表面- 时间表达方式。 为了有效地模拟多功能依赖关系, 并在假设特征之间建立牢固的关系, 任务将分为三个阶段:(i) 生成多个初始假设演示; (ii) 模型自合体通信, 将多个假说合并为单一的趋同代表, 然后将它分成若干不同的假说;(iii) 学习跨功能交流, 并汇总多功能变形模型, 以合成最后的3D构成。 通过上述进程, 最终的表述方式得到加强, 而合成的组合式模型将分为三个阶段:(i) IM- 3M-3 实验显示M- droupal 。