Previous unsupervised sentence embedding studies have focused on data augmentation methods such as dropout masking and rule-based sentence transformation methods. However, these approaches have a limitation of controlling the fine-grained semantics of augmented views of a sentence. This results in inadequate supervision signals for capturing a semantic similarity of similar sentences. In this work, we found that using neighbor sentences enables capturing a more accurate semantic similarity between similar sentences. Based on this finding, we propose RankEncoder, which uses relations between an input sentence and sentences in a corpus for training unsupervised sentence encoders. We evaluate RankEncoder from three perspectives: 1) the semantic textual similarity performance, 2) the efficacy on similar sentence pairs, and 3) the universality of RankEncoder. Experimental results show that RankEncoder achieves 80.07\% Spearman's correlation, a 1.1% absolute improvement compared to the previous state-of-the-art performance. The improvement is even more significant, a 1.73% improvement, on similar sentence pairs. Also, we demonstrate that RankEncoder is universally applicable to existing unsupervised sentence encoders.
翻译:先前未经监督的句子嵌入研究侧重于数据增强方法,如辍学蒙面和基于规则的句子转换方法。 但是,这些方法限制了对加重刑期观点的精细拼写语义的控制。 这导致对类似判决的语义相似性缺乏足够的监督信号。 在这项工作中,我们发现使用邻居的句子可以捕捉类似判决之间更为准确的语义相似性。 基于这一发现,我们建议 Rank Encoder 使用输入句子和句子之间的关系, 用于培训未经监督的句子编码器。 我们从三个角度评估了 RankEncoder 的语义相似性表现, 2 和 3 类似判决对语义的功效。 实验结果表明, Rank Encoder 取得了80.07 ⁇ spearman 的关联性, 与以前的状态- 艺术表现相比, 1. 绝对性改进了1. 1% 。 在类似的句子上,改进幅度更大, 1.73% 改进了。 另外,我们证明, RankEncoder 是普遍适用于现有的未监督的句子。