Attention-based neural networks such as the Vision Transformer (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
翻译:视觉变换器(VIT)等以关注为基础的神经网络最近在许多计算机视觉基准上取得了最先进的结果。 规模是取得优异结果的一个主要要素。 因此, 理解模型的缩放属性是有效设计后代的关键。 虽然已经研究过放大变换器语言模型的法律, 但不清楚如何扩大。 为了解决这个问题, 我们向上和向下放大VIT模型和数据, 并描述错误率、 数据和计算之间的关系。 与此同时, 我们完善了VIT的架构和培训, 减少了记忆消耗, 提高了所生成模型的精度。 因此, 我们成功地培训了具有20亿参数的VIT模型, 从而在图像网络上实现了90.45%最高至1精确度的新状态。 该模型还很好地运用了几分数的学习方法, 例如, 在图像网络上达到84.86%的最高和1级的精度, 每类只有10个例子。