Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
翻译:最近,纯粹基于关注的神经网络被展示为处理图像理解任务,例如图像分类。然而,这些视觉变压器在使用昂贵的基础设施对数以亿计的图像进行预先培训,从而限制其采用。在这项工作中,我们只通过在图像网上的培训来生产具有竞争力的无革命性的变压器。我们在不到3天的时间里用一台计算机对其进行培训。我们的参考视觉变压器(86M参数)在图像网上实现了83.1%的顶级精确度(单项作物评估),没有外部数据。更重要的是,我们引入了针对变压器的师生战略。我们使用蒸馏符号确保学生通过关注从教师那里学习。我们展示了这种象征性蒸馏的兴趣,特别是当我们使用同级网络作为教师的时候。这导致我们在图像网(我们在那里获得85.2%的精确度)和向其他任务转移时,都报告与同级网络竞争的结果。我们分享我们的代码和模型。