Though offering amazing contextualized token-level representations, current pre-trained language models actually take less attention on acquiring sentence-level representation during its self-supervised pre-training. If self-supervised learning can be distinguished into two subcategories, generative and contrastive, then most existing studies show that sentence representation learning may more benefit from the contrastive methods but not the generative methods. However, contrastive learning cannot be well compatible with the common token-level generative self-supervised learning, and does not guarantee good performance on downstream semantic retrieval tasks. Thus, to alleviate such obvious inconveniences, we instead propose a novel generative self-supervised learning objective based on phrase reconstruction. Empirical studies show that our generative learning may yield powerful enough sentence representation and achieve performance in Sentence Textual Similarity (STS) tasks on par with contrastive learning. Further, in terms of unsupervised setting, our generative method outperforms previous state-of-the-art SimCSE on the benchmark of downstream semantic retrieval tasks.
翻译:尽管提供了惊人的背景化的象征性代表,但目前的预先培训语言模式在自监督的训练前阶段实际上不太重视获得判决一级代表。 如果自我监督的学习可以分为两个子类,即基因化和对比化,那么大多数现有研究表明,判决代表学习可能更多地受益于对比方法,而不是基因化方法。然而,对比学习不能与通用的象征一级基因化自监督学习完全兼容,也不能保证下游语义检索任务的良好表现。因此,为了减轻这些明显的不便,我们提议了一个基于重建短语的新型基因化自监督学习目标。 经验性研究表明,我们的基因化学习可能产生足够强大的判刑代表,并在句中文本相似性任务中取得与对比性学习相似的成绩。 此外,在未加监督的设置方面,我们的基因化方法超越了先前在下游语义检索任务基准方面的最先进的SimCSEE值。