Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.
翻译:现有的多人三维人体姿态和形状估计方法通常采用两阶段策略,首先在每个帧中检测人体实例,然后使用时间模型执行单人姿态和形状估计。然而,空间实例之间的全局时空上下文无法被捕捉。在本文中,我们提出了一种新的端到端多人三维姿态和形状估计框架,它使用渐进式视频变换器,称为PSVT。在PSVT中,时空编码器(STE)捕捉空间对象之间的全局特征依赖关系。然后,时空姿态解码器(STPD)和形状解码器(STSD)分别捕捉关节查询和特征令牌之间、形状查询和特征令牌之间的全局依赖关系。为了处理物体随时间推移的差异,采用逐步解码的新方案来更新每一帧的姿态和形状查询。此外,我们提出了一种新颖的姿态引导注意力(PGA)用于形状解码器,以更好地预测形状参数。这两个组件加强了PSVT的解码器,以提高性能。在四个数据集上的广泛实验表明,PSVT实现了最先进的结果。