Multi-person pose understanding from RGB videos includes three complex tasks: pose estimation, tracking and motion forecasting. Among these three tasks, pose estimation and tracking are correlated, and tracking is crucial to motion forecasting. Most existing works either focus on a single task or employ cascaded methods to solve each individual task separately. In this paper, we propose Snipper, a framework to perform multi-person 3D pose estimation, tracking and motion forecasting simultaneously in a single inference. Specifically, we first propose a deformable attention mechanism to aggregate spatiotemporal information from video snippets. Building upon this deformable attention, a visual transformer is learned to encode the spatiotemporal features from multi-frame images and to decode informative pose features to update multi-person pose queries. Last, these queries are regressed to predict multi-person pose trajectories and future motions in one forward pass. In the experiments, we show the effectiveness of Snipper on three challenging public datasets where a generic model rivals specialized state-of-art baselines for pose estimation, tracking, and forecasting. Code is available at https://github.com/JimmyZou/Snipper
翻译:从 RGB 视频中获取的多人方位理解包括三项复杂任务: 显示估计、 跟踪和运动预测。 在这三项任务中, 显示估计和跟踪是相互关联的, 跟踪是运动预测的关键。 大多数现有工作要么侧重于单一任务, 要么采用分层方法分别解决每项任务。 在本文中, 我们提议 Snipper, 一个框架, 在一个单一的推理中同时执行多人3D 显示估计、 跟踪和运动预测。 具体地说, 我们首先提议一个可变式的关注机制, 关注从视频片片中收集的汇总波段时空信息。 在这个可变的注意的基础上, 一个视觉变压器被学习将多框架图像中的波段时空特征编码, 并解码信息性特征以更新多人提出询问。 最后, 这些查询被重新用来预测多人构成的轨迹和在一条前方通道中的未来动作。 在实验中, 我们展示了Snipper三个挑战的公共数据集的有效性, 其中通用模型对立了用于预测、 跟踪和预测。 代码可在 http://Zmummup/ immusimmusypypy