Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design that preserves the inherent structure of feature map representations when modeling attention while reducing memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5% of Params and 16% of MACs on Human3.6M and 3DPW datasets. The project webpage is https://zczcwh.github.io/feater_page/.
翻译:最近,视觉Transformer在多个人类重建任务中表现出了出色的性能,例如2D肢体姿态估计(2D HPE)、3D肢体姿态估计(3D HPE)和人体网格重建(HMR)任务。在这些任务中,通常首先通过卷积神经网络(如HRNet)从图像中提取人类的结构特征映射表示,然后再通过Transformer进一步处理以预测HPE或HMR的热图(使用高斯分布对每个关节的位置进行编码)。然而,现有的Transformer架构不能直接处理这些特征图输入,迫使人类的结构信息在建模注意力时强行扁平化。此外,最近的HPE和HMR方法的许多性能优势都是以越来越高的计算和内存需求为代价的。因此,为了同时解决这些问题,我们提出了FeatER,一种新颖的Transformer设计,它在建模注意力时保留了特征图表示的固有结构,同时降低了内存和计算成本。利用FeatER,我们构建了一个高效的网络,用于一组人类重建任务,包括2D HPE、3D HPE和HMR。应用特征图重建模块,以提高估计的人类姿态和网格的性能。广泛的实验证明了FeatER在各种人类姿态和网格数据集上的有效性。例如,在Human3.6M和3DPW数据集上,FeatER的性能超过了SOTA方法MeshGraphormer,只需要5%的Params和16%的MACs。项目网页为https://zczcwh.github.io/feater_page/。