Following SimCSE, contrastive learning based methods have achieved the state-of-the-art (SOTA) performance in learning sentence embeddings. However, the unsupervised contrastive learning methods still lag far behind the supervised counterparts. We attribute this to the quality of positive and negative samples, and aim to improve both. Specifically, for positive samples, we propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence. This is to counteract the intrinsic bias of pre-trained token embeddings to frequency, word cases and subwords. For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model. Combining the above two methods with SimCSE, our proposed Contrastive learning with Augmented and Retrieved Data for Sentence embedding (CARDS) method significantly surpasses the current SOTA on STS benchmarks in the unsupervised setting.
翻译:在SimCSE之后,以对比学习为基础的方法在学习句子嵌入中实现了最先进的(SOTA)表现。然而,未经监督的对比学习方法仍然远远落后于受监督的对应方。我们将此归因于正和负样本的质量,目的是改进两者。具体地说,对于正和负样本,我们建议采用切换式扩增法,将随机选择的单词首字母翻转到句子中。这是为了抵消预先训练的象征性嵌入频率、字数和子字的内在偏差。对于负面样本,我们根据预先训练的语言模型,从整个数据集中抽取了硬负值。将以上两种方法与SimCSE合并,我们提议的与强化和重新检索的嵌入句数据(CARDS)的对比学习方法大大超过当前在不受监督的环境中STS基准的STA。