Vision transformer architectures have been demonstrated to work very effectively for image classification tasks. Efforts to solve more challenging vision tasks with transformers rely on convolutional backbones for feature extraction. In this paper we investigate the use of a pure transformer architecture (i.e., one with no CNN backbone) for the problem of 2D body pose estimation. We evaluate two ViT architectures on the COCO dataset. We demonstrate that using an encoder-decoder transformer architecture yields state of the art results on this estimation problem.
翻译:视觉变压器结构已证明在图像分类任务方面非常有效。通过变压器解决更具有挑战性的视觉任务的努力依靠进化骨干进行地貌提取。在本文件中,我们调查了2D体问题的纯变压器结构(即没有CNN骨干)的使用情况。我们评估了COCO数据集上的两个VIT结构。我们证明,使用编码器-解码器变压器结构可以得出关于这一估计问题的最新结果。