Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE metric) and 3DPW (all three metrics) datasets. The project webpage is https://zczcwh.github.io/potter_page.
翻译:Transformer体系架构已经在从单个图像中恢复人体网格(HMR) 任务上实现了最先进的性能。 然而,性能提升是以相当大的存储和计算开销为代价的。需要一种轻量级和高效的模型来重建准确的人体网格,以满足现实世界应用的需求。在本文中,我们提出了一种纯Transformer体系结构,称为POoling aTtention TransformER (POTTER),用于从单个图像中执行HMR任务。观察到传统的注意力模块占用存储和计算资源昂贵,因此我们提出了一种高效的池化注意力模块,该模块在不损失性能的情况下显着降低了存储和计算成本。此外,我们设计了一种新的Transformer体系结构,通过集成高分辨率(HR)流来用于HMR任务。来自HR流的高分辨率本地和全局特征可用于恢复更准确的人体网格。我们的POTTER在Human3.6M (PA-MPJPE度量标准)和3DPW(所有三个度量标准)数据集上的参数总数仅为SOTA方法METRO的7%,乘积-累加操作的14%。 该项目的网页是https://zczcwh.github.io/potter_page。