Contrastive learning has been attracting much attention for learning unsupervised sentence embeddings. The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as a minimal data augmentation method, and passes the same input sentence to a pre-trained Transformer encoder (with dropout turned on) twice to obtain the two corresponding embeddings to build a positive pair. As the length information of a sentence will generally be encoded into the sentence embeddings due to the usage of position embedding in Transformer, each positive pair in unsup-SimCSE actually contains the same length information. And thus unsup-SimCSE trained with these positive pairs is probably biased, which would tend to consider that sentences of the same or similar length are more similar in semantics. Through statistical observations, we find that unsup-SimCSE does have such a problem. To alleviate it, we apply a simple repetition operation to modify the input sentence, and then pass the input sentence and its modified counterpart to the pre-trained Transformer encoder, respectively, to get the positive pair. Additionally, we draw inspiration from the community of computer vision and introduce a momentum contrast, enlarging the number of negative pairs without additional calculations. The proposed two modifications are applied on positive and negative pairs separately, and build a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the proposed ESimCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.
翻译:在学习不受监督的句子嵌入方面,反正学习一直引起人们的极大注意。当前最先进的不受监督的方法是不受监督的 SimCSE (unsup-SimCSE ) 。Usup-SimCSE 将辍学作为一种最低限度的数据增强方法,并将同样的输入句传送到一个受过训练的变异器编码器( 辍学打开) 两次, 以获得两个相应的嵌入, 以构建一个正对称。 由于一个句子的长度信息通常会被编码在句子中嵌入嵌入。 由于在变异器中嵌入位置的用法, 每一个在不监管的SimCSimCSE(未监督的正对配对), 因此用这些正对子训练的不提升SimCSimCSe, 可能会有偏差, 这可能会把相同或类似的句子输入到语义中。我们通过统计观察发现, 错误的SimCSe(不完善的)确实修改确实的句子的确存在这样的问题。为了减轻它, 我们在修改输入句子, 然后在不修改输入输入输入Se-deal-Se-Se-Se-Seusal 。Sereal 将一个输入的输入的输入的输入, 。