Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos. Recent transformer-based approaches focus on capturing the spatiotemporal information from sequential 2D poses, which cannot model the contextual depth feature effectively since the visual depth features are lost in the step of 2D pose estimation. In this paper, we simplify the paradigm into an end-to-end framework, Instance-guided Video Transformer (IVT), which enables learning spatiotemporal contextual depth information from visual features effectively and predicts 3D poses directly from video frames. In particular, we firstly formulate video frames as a series of instance-guided tokens and each token is in charge of predicting the 3D pose of a human instance. These tokens contain body structure information since they are extracted by the guidance of joint offsets from the human center to the corresponding body joints. Then, these tokens are sent into IVT for learning spatiotemporal contextual depth. In addition, we propose a cross-scale instance-guided attention mechanism to handle the variational scales among multiple persons. Finally, the 3D poses of each person are decoded from instance-guided tokens by coordinate regression. Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
翻译:3D 人造图像 3D 图像 3D 人造图象估计, 目的是将视频中3D 3D 人类联合坐标定位。 最近以变压器为基础的方法侧重于从连续 2D 配置中采集随机波时间信息, 无法有效地模拟背景深度特征, 因为视觉深度特征在 2D 配置的一步中丢失了。 在本文中, 我们将范例简化为端对端框架, 即 点导视频变形器( IVT ), 使得能够从视觉特征中学习广度背景深度信息, 并预测3D 直接来自视频框架。 特别是, 我们首先将视频框架设计成一系列由实例制导的符号, 并且每个符号都负责预测人类实例的 3D 3D 配置包含人体结构信息, 因为这些符号是从人类中心到相应机构连接的联合抵消指南中提取的。 之后, 这些符号被发送到 IVT, 以学习广度背景深度 。 此外, 我们提议一个跨尺度制导的注意机制, 处理多个人的变形尺度。 最后, 每个人的 3D 3D 配置 将显示从图像 显示 格式 格式 格式 。