The widespread application of 3D human pose estimation (HPE) is limited by resource-constrained edge devices, requiring more efficient models. A key approach to enhancing efficiency involves designing networks based on the structural characteristics of input data. However, effectively utilizing the structural priors in human skeletal inputs remains challenging. To address this, we leverage both explicit and implicit spatio-temporal priors of the human body through innovative model design and a pre-training proxy task. First, we propose a Nano Human Topology Network (NanoHTNet), a tiny 3D HPE network with stacked Hierarchical Mixers to capture explicit features. Specifically, the spatial Hierarchical Mixer efficiently learns the human physical topology across multiple semantic levels, while the temporal Hierarchical Mixer with discrete cosine transform and low-pass filtering captures local instantaneous movements and global action coherence. Moreover, Efficient Temporal-Spatial Tokenization (ETST) is introduced to enhance spatio-temporal interaction and reduce computational complexity significantly. Second, PoseCLR is proposed as a general pre-training method based on contrastive learning for 3D HPE, aimed at extracting implicit representations of human topology. By aligning 2D poses from diverse viewpoints in the proxy task, PoseCLR aids 3D HPE encoders like NanoHTNet in more effectively capturing the high-dimensional features of the human body, leading to further performance improvements. Extensive experiments verify that NanoHTNet with PoseCLR outperforms other state-of-the-art methods in efficiency, making it ideal for deployment on edge devices like the Jetson Nano. Code and models are available at https://github.com/vefalun/NanoHTNet.
翻译:三维人体姿态估计(HPE)的广泛应用受限于资源受限的边缘设备,因此需要更高效的模型。提升效率的一个关键途径在于基于输入数据的结构特性设计网络。然而,如何有效利用人体骨骼输入中的结构先验信息仍具挑战。为此,我们通过创新的模型设计和预训练代理任务,同时利用了人体的显式和隐式时空先验。首先,我们提出了纳米级人体拓扑网络(NanoHTNet),这是一种微型的3D HPE网络,通过堆叠分层混合器来捕获显式特征。具体而言,空间分层混合器能在多个语义层级上高效学习人体物理拓扑结构,而结合离散余弦变换与低通滤波的时间分层混合器则能捕捉局部瞬时运动和全局动作连贯性。此外,我们引入了高效时空令牌化(ETST)方法,以增强时空交互并显著降低计算复杂度。其次,我们提出了PoseCLR作为一种基于对比学习的通用3D HPE预训练方法,旨在提取人体拓扑的隐式表征。通过在代理任务中对齐来自不同视角的2D姿态,PoseCLR能够帮助如NanoHTNet等3D HPE编码器更有效地捕获人体的高维特征,从而实现进一步的性能提升。大量实验验证,结合PoseCLR的NanoHTNet在效率上优于其他最先进方法,使其非常适合部署在Jetson Nano等边缘设备上。代码与模型可在 https://github.com/vefalun/NanoHTNet 获取。