Transformer encoder architectures have recently achieved state-of-the-art results on monocular 3D human mesh reconstruction, but they require a substantial number of parameters and expensive computations. Due to the large memory overhead and slow inference speed, it is difficult to deploy such models for practical use. In this paper, we propose a novel transformer encoder-decoder architecture for 3D human mesh reconstruction from a single image, called FastMETRO. We identify the performance bottleneck in the encoder-based transformers is caused by the token design which introduces high complexity interactions among input tokens. We disentangle the interactions via an encoder-decoder architecture, which allows our model to demand much fewer parameters and shorter inference time. In addition, we impose the prior knowledge of human body's morphological relationship via attention masking and mesh upsampling operations, which leads to faster convergence with higher accuracy. Our FastMETRO improves the Pareto-front of accuracy and efficiency, and clearly outperforms image-based methods on Human3.6M and 3DPW. Furthermore, we validate its generalizability on FreiHAND.
翻译:变换器编码器结构最近取得了单眼 3D 人类网格重建方面的最先进的结果,但需要大量参数和昂贵的计算。由于记忆管理器和低推速,很难将这种模型用于实际用途。在本文中,我们建议用一个名为 FastMETRO 的图像为3D 人类网格重建建立一个新型变压器编码器-解码器结构。我们确定以编码器为基础的变压器的性能瓶颈是由在输入符号之间产生高度复杂的相互作用的象征性设计造成的。我们通过编码器-解密器结构分解这些相互作用,这使我们的模型可以要求更少的参数和较短的推算时间。此外,我们通过关注遮罩和图示操作将人类身体形态关系的先前知识强加于人,从而加快精确度的趋同速度。我们的FastMEDRO 改进了输入器的精确度和效率,并且明显超越了以图像为基础的人类3.6M 和 3DPW 。此外,我们验证了人类3.6M 和 3DPW 的通用性。