We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP
翻译:我们提出了 DisCo-CLIP,一种分布式记忆高效的 CLIP 训练方法,以减少对比学习模型训练中对比损失的内存消耗。我们的方法将对比损失及其梯度计算分解为两部分,一部分计算内部 GPU 梯度,另一部分计算跨 GPU 的梯度。根据我们的分解方法,仅在当前 GPU 上计算内部 GPU 梯度,而跨 GPU 的梯度则通过 all_reduce 从其他 GPU 中收集,而不是在每个 GPU 上重复计算。这样,我们可以将对比损失计算的 GPU 内存消耗从 $\bigO (B^2)$ 降低到 $\bigO (\frac {B^2}{N})$,其中 $B$ 和 $N$ 分别是批量大小和用于训练的 GPU 数量。这样的分布式解决方案在大批量 CLIP 训练中特别高效。例如,DisCo-CLIP 可以使用 8 或 64 个 A100 40GB GPU 对 ViT-B/32 模型进行对比训练,批量大小为 32K 或 196K,而原始 CLIP 解决方案则需要 128 个 A100 40GB GPU 来对批量大小为 32K 的 ViT-B/32 模型进行训练。代码将在 https://github.com/IDEA-Research/DisCo-CLIP 上发布。