We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. With only four TPUv4 chips, we can train a Base CLIP model at 4k batch size and a Large LiT model at 20k batch size, the latter achieves 84.5% ImageNet zero-shot accuracy in two days. This disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.
翻译:我们提出了一种简单的成对Sigmoid损失函数,用于图像文本预训练。与采用softmax归一化的标准对比学习不同,Sigmoid损失仅在图像文本成对上操作,而不需要全局观察成对相似性以进行归一化。Sigmoid损失函数同时允许进一步扩大批次大小,并在更小批次大小下表现更好。使用四个TPUv4芯片,我们可以在4k批次大小下训练Base CLIP模型,并在两天内在20k批次大小下训练大型LiT模型;后者实现了84.5%的ImageNet零样本准确率。该成对批次于损失的解耦还允许我们研究示例和成对、负进正比率的影响。最后,我们将批次大小推至极限,高达一百万,发现批次大小增长的好处迅速减少,更为合理的批次大小为32k已足够。我们希望我们的研究能够激发进一步探索提高语言图像预训练质量和效率的工作。