Egocentric 3D human pose estimation (HPE) from images is challenging due to severe self-occlusions and strong distortion introduced by the fish-eye view from the head mounted camera. Although existing works use intermediate heatmap-based representations to counter distortion with some success, addressing self-occlusion remains an open problem. In this work, we leverage information from past frames to guide our self-attention-based 3D HPE estimation procedure -- Ego-STAN. Specifically, we build a spatio-temporal Transformer model that attends to semantically rich convolutional neural network-based feature maps. We also propose feature map tokens: a new set of learnable parameters to attend to these feature maps. Finally, we demonstrate Ego-STAN's superior performance on the xR-EgoPose dataset where it achieves a 30.6% improvement on the overall mean per-joint position error, while leading to a 22% drop in parameters compared to the state-of-the-art.
翻译:图像中的以地球为中心的 3D 人形估计(HPE) 具有挑战性,因为头架相机的鱼眼视图引入了严重的自我隔离和强烈扭曲。虽然现有作品使用中间热映射表示法成功地抵消扭曲,但解决自我封闭仍然是一个尚未解决的问题。在这项工作中,我们利用过去框架的信息来指导我们基于自我注意的3D HPE 估计程序 -- -- Ego-STAN。具体地说,我们建造了一个片段-时空变异器模型,该模型将关注于以语义丰富的神经网络为基础的地貌图。我们还提出了地貌图符号:一套新的可学习参数,用于观看这些地貌图。最后,我们展示了Ego-STAN在 xR-EgoPose数据集上的优异性表现,它使整体平均连接位置错误提高了30.6%。同时导致参数与最新技术相比下降了22%。