We present a combined scaling method called BASIC that achieves 85.7% top-1 zero-shot accuracy on the ImageNet ILSVRC-2012 validation set, surpassing the best-published zero-shot models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 83.7% top-1 average accuracy, only a small drop from the its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. The main challenge with scaling is the limited memory of our accelerators such as GPUs and TPUs. We hence propose a simple method of online gradient caching to overcome this limit.
翻译:我们提出了一个称为BASIC的合并缩放方法,该方法在图像Net ILSVRC-2012 校验集上达到85.7%的顶级-1零点精确度,比最公开的零发模型CLIP和ALIGN-9.3%高出9.3 %。我们的BASIC模型还显示,稳健性基准有显著改进。例如,在图像Net-{A,R,V2,Sketch}和ObjectNet等自然分布变化的5个测试组中,我们的模型达到83.7%的顶级-1平均精确度,仅比原图像网络的精确度小一点。为了实现这些结果,我们扩大了CLIP和ALIGNT的对比性学习框架,在三个方面是:数据大小、模型大小和批量大小。我们的数据集有6.B的扰动图像文本配对,比ALIGIG大4。我们最大的模型有3B重量,比ALIG和CLIP的参数大3.75x还要大,比ALILIG和CLIP的原精度小。我们分批的65536,比CIP和4x的对比比CLIPLILILILILILILLLILA提出一个有限的硬度限制。