Contrastive learning has been gradually applied to learn high-quality unsupervised sentence embedding. Among the previous un-supervised methods, the latest state-of-the-art method, as far as we know, is unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE uses the InfoNCE1loss function in the training stage by pulling semantically similar sentences together and pushing apart dis-similar ones.Theoretically, we expect to use larger batches in unsup-SimCSE to get more adequate comparisons among samples and avoid overfitting. However, increasing the batch size does not always lead to improvements, but instead even lead to performance degradation when the batch size exceeds a threshold. Through statistical observation, we find that this is probably due to the introduction of low-confidence negative pairs after in-creasing the batch size. To alleviate this problem, we introduce a simple smoothing strategy upon the InfoNCE loss function, termedGaussian Smoothing InfoNCE (GS-InfoNCE).Specifically, we add random Gaussian noise vectors as negative samples, which act asa smoothing of the negative sample space.Though being simple, the proposed smooth-ing strategy brings substantial improvements to unsup-SimCSE. We evaluate GS-InfoNCEon the standard semantic text similarity (STS)task. GS-InfoNCE outperforms the state-of-the-art unsup-SimCSE by an average Spear-man correlation of 1.38%, 0.72%, 1.17% and0.28% on the base of BERT-base, BERT-large,RoBERTa-base and RoBERTa-large, respectively.
翻译:渐渐应用了对比性学习来学习高质量的未经监管的句子嵌入。 在先前的未经监督的方法中, 据我们所知,最新的最先进的艺术方法并非不受监督的SimCSE(unsup-SIMCSE) 。 Usup-SimCSE 在培训阶段使用InfONCEE1loss 函数, 将语义相似的句子拉在一起, 并推开不相近的句子。 从理论上看, 我们期望在未升级的 SIMCSE 中使用更大的批量, 以便在样本之间进行更充分的比较。 然而, 增加批量规模并不总是导致改进, 而当批量大小超过阈值时,甚至会导致性能退化。 通过统计观察,我们发现这可能是由于在批量大小中引入了低自信负对子。 为了缓解这一问题, 我们在InfocNCE损失功能中引入了一个简单的平滑动策略, 将Gassian Slishan Stal NCEE (GS-InNCE) 的平流化为否定的样本。