Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining, Global Contrastive Batch Sampling (GCBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$, in contrastive learning settings. Through experimentation we find GCBS improves state-of-the-art performance in sentence embedding and code-search tasks. Additionally, GCBS is easy to implement as it requires only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient than the most minimal hard negative mining approaches, and makes no changes to the model being trained.
翻译:最近,许多对比鲜明的学习方法在一系列广泛任务中取得了最先进的成绩。许多对比鲜明的学习方法使用挖掘的硬反差使批量在培训期间信息更加丰富,但这些方法效率低下,因为它们增加了与埋存负差数量成比例的时长,需要经常更新最近的邻里指数或最近批次的采矿。在这项工作中,我们提供了一种替代硬性负采矿的替代方法,即全球反竞争批量抽样(GCBS),这是对批量分配问题的高效近似,它对全球和培训损失之间的差距有上限,在对比鲜明的学习环境中,$\mathcal{L ⁇ Global}-\mathcal{L ⁇ Train}。我们通过实验发现GCBS改善了判决嵌入和代码搜索任务中最先进的表现。此外,GCBS很容易实施,因为它只需要几条额外的代码线,不维持最近的邻里指数等外部数据结构,比最起码的硬性负采矿方法更有计算效率,对模型没有作任何修改。