Contrastive learning has been extensively studied in sentence embedding learning, which assumes that the embeddings of different views of the same sentence are closer. The constraint brought by this assumption is weak, and a good sentence representation should also be able to reconstruct the original sentence fragments. Therefore, this paper proposes an information-aggregated contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. InfoCSE forces the representation of [CLS] positions to aggregate denser sentence information by introducing an additional Masked language model task and a well-designed network. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large, achieving state-of-the-art results among unsupervised sentence representation learning methods. Our code are available at https://github.com/caskcsg/sentemb/tree/main/InfoCSE.
翻译:在嵌入学习的句子中,对矛盾学习进行了广泛的研究,认为同一句子的不同观点的嵌入更加接近。这一假设带来的制约是薄弱的,良好的句子表述方式应能重建原来的句子碎片。因此,本文件提议了一个信息汇总对比学习框架,用于学习未经监督的句子嵌入,称为InfoCSE。InfCSE通过引入额外的蒙面语言模型和设计完善的网络,迫使[CLS]职位的表示方式将更密集的句子信息汇总起来。我们评估了拟议的关于若干基准数据集的InfoCSEE, 内容为语义相似性(STS)任务。实验结果显示,InfoCSEE 超越了SimCSE, 其平均Spearman在BERT-Basebase的比值为2.60%,BERT-base的比值为1.77%,在未经监督的句子表述方法中实现最先进的结果。我们的代码可在https://github.com/cskksg/ sentem/stree/main/InfoCSEE.。