The attention mechanism provides a sequential prediction framework for learning spatial models with enhanced implicit temporal consistency. In this work, we show a systematic design (from 2D to 3D) for how conventional networks and other forms of constraints can be incorporated into the attention framework for learning long-range dependencies for the task of pose estimation. The contribution of this paper is to provide a systematic approach for designing and training of attention-based models for the end-to-end pose estimation, with the flexibility and scalability of arbitrary video sequences as input. We achieve this by adapting temporal receptive field via a multi-scale structure of dilated convolutions. Besides, the proposed architecture can be easily adapted to a causal model enabling real-time performance. Any off-the-shelf 2D pose estimation systems, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. Our method achieves the state-of-the-art performance and outperforms existing methods by reducing the mean per joint position error to 33.4 mm on Human3.6M dataset.
翻译:关注机制为学习空间模型提供了顺序预测框架,提高了隐含的时间一致性。在这项工作中,我们展示了一种系统的设计(从 2D 到 3D),用于将传统网络和其他形式的制约因素纳入关注框架,以学习长期依赖性,以了解进行构成估计的任务。本文件的贡献是提供一种系统的方法,用于设计和培训基于关注的模型,用于最终到最终的预测,并具有作为投入的任意视频序列的灵活性和可缩放性。我们通过一个多尺度的变相结构来调整时间可容域。此外,拟议的结构可以很容易地适应一个促成实时性能的因果模型。任何外的2D构成估算系统,例如Mocap图书馆,都可以很容易地融入到一个特别的状态中。我们的方法通过将人文3.6M数据集上每个联合位置的平均误差减少到33.4毫米,从而实现最先进的性能并超越现有方法。