We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$\times$1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet-V2 image classification, 63.1/54.4 box/mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. Our techniques are generally applicable for scaling up vision models, which has not been widely explored as that of NLP language models, partly due to the following difficulties in training and applications: 1) vision models often face instability issues at scale and 2) many downstream vision tasks require high resolution images or windows and it is not clear how to effectively transfer models pre-trained at low resolutions to higher resolution ones. The GPU memory consumption is also a problem when the image resolution is high. To address these issues, we present several techniques, which are illustrated by using Swin Transformer as a case study: 1) a post normalization technique and a scaled cosine attention approach to improve the stability of large vision models; 2) a log-spaced continuous position bias technique to effectively transfer models pre-trained at low-resolution images and windows to their higher-resolution counterparts. In addition, we share our crucial implementation details that lead to significant savings of GPU memory consumption and thus make it feasible to train large vision models with regular GPUs. Using these techniques and self-supervised pre-training, we successfully train a strong 3B Swin Transformer model and effectively transfer it to various vision tasks involving high-resolution images or windows, achieving the state-of-the-art accuracy on a variety of benchmarks.
翻译:我们展示了将Swin变形器提升到30亿个参数的技巧,使Swin变形器能够以高达1,536美元时间的图像进行1,536美元分辨率的培训。通过提升能力和分辨率,Swin变形器在四个有代表性的愿景基准上设定了新的记录:图像Net-V2图像分类的84.0%最高-1精确度;COCO物体检测的63.1.54.4框/mask mAP;ADE20K语义分解的59.9 mIoU;Kindicatics-400视频动作分类的86.8%最高-1精确度。我们的技术一般适用于扩大愿景模型,而这种模型作为常规语言模型尚未广泛探索,部分原因是在培训和应用中存在以下困难:(1) 愿景模型常常面临规模上的不稳定问题;(2) 许多下游愿景任务需要高分辨率图像或窗口,不清楚如何有效地将低分辨率预培训的模型转换为更高分辨率的模型。在图像解析前,GPU记忆消耗度之前也存在问题,在图像解析度-40个窗口变异。为了解决这些问题,我们展示了几种技术,我们展示了几种技术,通过使用Swin 更新的大型变形变形变形变形模型来进行大型变形变形变形模型来进行大型变形变形模型,我们用大量的注意。