The scaling of Transformers has driven breakthrough capabilities for language models. At present, the largest large language models (LLMs) contain upwards of 100B parameters. Vision Transformers (ViT) have introduced the same architecture to image and video modelling, but these have not yet been successfully scaled to nearly the same degree; the largest dense ViT contains 4B parameters (Chen et al., 2022). We present a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and perform a wide variety of experiments on the resulting model. When evaluated on downstream tasks (often with a lightweight linear model on frozen features), ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. ViT-22B demonstrates the potential for "LLM-like" scaling in vision, and provides key steps towards getting there.
翻译:目前,最大的大型语言模型(LLMS)包含100B参数以上。 愿景变异器(VIT)引入了相同的图像和视频建模架构,但这些架构尚未成功推广到几乎同等的程度;最稠密的VIT包含4B参数(Chen等人,2022年)。我们提出了对22B参数ViT(ViT-22B)进行高效和稳定培训的秘方,并对由此形成的模型进行了广泛的实验。在对下游任务(通常使用轻量级的僵化特征线性模型)进行评估时,ViT-22B展示了规模化的绩效。我们进一步观察了其他令人感兴趣的规模效益,包括公平与性之间的权衡、在形状/文字偏差方面与人类视觉观的先进调整以及增强的坚固性。ViT-22B展示了“LM-类似”的视觉缩放潜力,并为实现这一目标提供了关键步骤。