ViTPose:人类粒子估计的简单愿景变形基线 (ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation)

from arxiv, Tech report. 81.1 mAP on MS COCO Keypoint Detection test-dev set. V2: Update Multi-task training results: 92.8 AP on OCHuman, 78.3 AP on CrowdPose, 94.3 PCKh on MPII, and 43.2 AP on AI Challenger

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

翻译：虽然设计中没有考虑到具体的领域知识,但普通的视觉变压器在视觉识别任务方面表现良好,然而,几乎没有努力揭示这种简单结构在提出估计任务方面的潜力。在本文件中,我们展示了普通的视觉变压器具有令人惊讶的良好能力,能够从各方面作出估计,即:模型结构简单、模型规模可扩缩、培训范式灵活度、通过称为ViTPose的简单基线模型在模型之间转让知识。具体地说,ViTPose使用普通和非等级的视觉变压器作为主干线,为某个个人实例提取特征,并提供一个轻度的变压器,用于进行表面估计。通过利用可扩缩模型能力的优势和高平行变压器的高平行性来将参数从100M升至1B。此外,ViTP在关注类型、投入解析、预先培训和微调战略方面,以及处理多重表面任务方面,ViTPose大模型的知识可以很容易通过简单的ViTP模型转让给小型模型,而MS-TP在具有挑战性代表性的新的基准模型上,实验结果显示,而实验-ITP-ILA-ILA-S-IG-ILA-S-IG-S-S-IAR-ID-S-S-IAR-ID-IAR-IAR-I-I-I-ID-I-ID-ID-I-I-I-I-I-I-ID-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-ID-I-IMA-I-I-I-I-I-I-I-I-IMA-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-IMA-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-