This paper presents Volumetric Transformer Pose estimator (VTP), the first 3D volumetric transformer framework for multi-view multi-person 3D human pose estimation. VTP aggregates features from 2D keypoints in all camera views and directly learns the spatial relationships in the 3D voxel space in an end-to-end fashion. The aggregated 3D features are passed through 3D convolutions before being flattened into sequential embeddings and fed into a transformer. A residual structure is designed to further improve the performance. In addition, the sparse Sinkhorn attention is empowered to reduce the memory cost, which is a major bottleneck for volumetric representations, while also achieving excellent performance. The output of the transformer is again concatenated with 3D convolutional features by a residual design. The proposed VTP framework integrates the high performance of the transformer with volumetric representations, which can be used as a good alternative to the convolutional backbones. Experiments on the Shelf, Campus and CMU Panoptic benchmarks show promising results in terms of both Mean Per Joint Position Error (MPJPE) and Percentage of Correctly estimated Parts (PCP). Our code will be available.
翻译:本文展示了3D综合3D变压器(VTP),这是用于多视3D人3D人造面估计的第一个3D体积变压器框架。VTP综合了所有摄像视图中2D关键点的特征,并以端到端的方式直接学习3D voxel空间的空间关系。3D综合特征通过3D变压器通过3D变压器通过3D变压器传递,然后被划为相接嵌嵌入和输入变压器。剩余结构旨在进一步改善性能。此外,稀疏的Sinkhorn关注度被授权降低记忆成本,这是体积表现的主要瓶颈,同时也取得了优异性。变压器的输出再次通过残余设计与3D变压器空间的空间结合。拟议的VTP框架将变压器的高性功能与体积表示法相结合,可以用作向骨架的良好替代方法。大陆架、地块和CMUMU光谱基准显示,在共同定位误差(MPJP)和我们估计部件的正确代码方面将产生很有希望的结果。