We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC.
翻译:我们提出了一种称为 BASIC 的组合缩放方法,它在不学习任何带标签的 ImageNet 样例的情况下,在 ImageNet ILSVRC-2012 验证集上实现了 85.7% 的 top-1 准确率。这个准确率比已发表的类似模型 CLIP 和 ALIGN 都要高出 9.3%。我们的 BASIC 模型在鲁棒性评估中也表现出显著的改进。例如,在 5 个具有自然分布差异的测试集 ImageNet-A、R、V2、Sketch 和 ObjectNet 上,我们的模型实现了84.3% 的 top-1 平均准确率,与其原始的 ImageNet 准确率相差甚微。为了实现这些结果,我们在三个维度上扩展了 CLIP 和 ALIGN 的对比学习框架:数据大小、模型大小和批大小。我们的数据集有 66 亿个带噪声的图像-文本对,比 ALIGN 要大 4 倍,比 CLIP 大 16 倍。我们最大的模型有 30 亿个权重,其参数量是 ALIGN 和 CLIP 的 3.75 倍,操作量是它们的 8 倍。最后,我们的批大小为 65536,比 CLIP 大 2 倍,比 ALIGN 要大 4 倍。我们在实现 BASIC 的组合缩放规则时遇到了两个主要的挑战。首先,我们遇到的主要挑战是加速器(如 GPU 和 TPU)的有限内存。为了克服内存限制,我们提出了两种简单的方法,利用梯度检查点和模型并行性。其次,虽然增加数据集大小和模型大小是提高深度学习模型(如 BASIC)性能的一种标准方法,但大对比批大小对这种基于对比训练的图像-文本模型的影响还不为人们所了解。为了探究大对比批大小的益处,我们开发了一个理论框架,证明了较大的对比批大小可以导致像 BASIC 这样的图像-文本模型的较小的泛化差距。