Recently, customized vision transformers have been adapted for human pose estimation and have achieved superior performance with elaborate structures. However, it is still unclear whether plain vision transformers can facilitate pose estimation. In this paper, we take the first step toward answering the question by employing a plain and non-hierarchical vision transformer together with simple deconvolution decoders termed ViTPose for human pose estimation. We demonstrate that a plain vision transformer with MAE pretraining can obtain superior performance after finetuning on human pose estimation datasets. ViTPose has good scalability with respect to model size and flexibility regarding input resolution and token number. Moreover, it can be easily pretrained using the unlabeled pose data without the need for large-scale upstream ImageNet data. Our biggest ViTPose model based on the ViTAE-G backbone with 1 billion parameters obtains the best 80.9 mAP on the MS COCO test-dev set, while the ensemble models further set a new state-of-the-art for human pose estimation, i.e., 81.1 mAP. The source code and models will be released at https://github.com/ViTAE-Transformer/ViTPose.
翻译:最近,定制的视像变压器已经适应了人的构成估计,并取得了精密结构的优异性能。然而,尚不清楚普通的视像变压器能否便于作出估计。在本文中,我们迈出了第一步,通过使用一个普通和非等级的视像变压器,以及简单的分变变变变变压器,称为ViTPose, 人类构成估计值的ViTPose。我们证明,在对人构成估计数据集进行微调后,一个普通的视像变压器可以取得优异性能。ViTPose在输入分辨率和符号编号的模型大小和灵活性方面,具有良好的可缩缩缩放性。此外,使用未贴标签的图像变压器数据,无需大规模上游图像网络数据,就可以很容易预先加以训练。我们最大的ViTPose模型以ViTAE-G骨架为基础,有10亿个参数。我们最大的ViTPE-G骨干模型获得了MS CO 测试-devevet成套的809 mAP,而堆积模型将进一步设定一个新的人类构成估计状态,即811 mAPAP。源代码和模型将在 http://TITP-TRAVI/VI/VISTrefrefreformexy/VI/VI/VERvreformexy。