We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations.
翻译:我们提出了一种从大规模和异构数据资源中学习人体运动表示法来处理各种人体中心视频任务的统一视角。具体而言,我们提出了一个预训练阶段,其中训练一个运动编码器以从嘈杂的部分2D观测中恢复潜在的3D运动。以这种方式获得的运动表示法包含关于人体运动的几何、运动学和物理知识,可以轻松地转移到多个下游任务。我们使用双流时空转换器(DSTformer)神经网络实现了运动编码器。它可以全面和自适应地捕捉骨骼关节之间的长时空关系,最低的3D姿态估计误差证明了从头训练时的可靠性。此外,我们提出的框架通过简单的回归头(1-2层)对预训练的运动编码器进行微调,实现了所有三个下游任务的最新性能,展示了学到的运动表示法的多样性。