Existing methods for human mesh recovery mainly focus on single-view frameworks, but they often fail to produce accurate results due to the ill-posed setup. Considering the maturity of the multi-view motion capture system, in this paper, we propose to solve the prior ill-posed problem by leveraging multiple images from different views, thus significantly enhancing the quality of recovered meshes. In particular, we present a novel \textbf{M}ulti-view human body \textbf{M}esh \textbf{T}ranslator (MMT) model for estimating human body mesh with the help of vision transformer. Specifically, MMT takes multi-view images as input and translates them to targeted meshes in a single-forward manner. MMT fuses features of different views in both encoding and decoding phases, leading to representations embedded with global information. Additionally, to ensure the tokens are intensively focused on the human pose and shape, MMT conducts cross-view alignment at the feature level by projecting 3D keypoint positions to each view and enforcing their consistency in geometry constraints. Comprehensive experiments demonstrate that MMT outperforms existing single or multi-view models by a large margin for human mesh recovery task, notably, 28.8\% improvement in MPVE over the current state-of-the-art method on the challenging HUMBI dataset. Qualitative evaluation also verifies the effectiveness of MMT in reconstructing high-quality human mesh. Codes will be made available upon acceptance.
翻译:人类网目恢复的现有方法主要侧重于单视框架,但由于设置不当,往往无法产生准确的结果。考虑到多视运动捕捉系统的成熟性,在本文件中,我们提议通过利用不同观点的多种图像来解决先前存在的错误问题,从而大大提高回收的网目图像的质量。特别是,我们提出了一个新型的\ textbf{M}M}m}ulti-view人体身体\ textbf{M}M}{Textbf{T}}}Translusluslator(MMMMT)模型,用于在视觉变异器的帮助下估计人体网目。具体地,MMMT将多视图像作为输入,并以单向的方式将其转化为目标的网目。MMMMT在编码和分解阶段都结合了不同观点的特征,导致全球信息的表达。此外,MMMMT在特征层面,通过向每个观点投放3D关键点的位置,并在地理测量限制中实施其一致性。