Though offering amazing contextualized token-level representations, current pre-trained language models take less attention on accurately acquiring sentence-level representation during their self-supervised pre-training. However, contrastive objectives which dominate the current sentence representation learning bring little linguistic interpretability and no performance guarantee on downstream semantic tasks. We instead propose a novel generative self-supervised learning objective based on phrase reconstruction. To overcome the drawbacks of previous generative methods, we carefully model intra-sentence structure by breaking down one sentence into pieces of important phrases. Empirical studies show that our generative learning achieves powerful enough performance improvement and outperforms the current state-of-the-art contrastive methods not only on the STS benchmarks, but also on downstream semantic retrieval and reranking tasks. Our code is available at https://github.com/chengzhipanpan/PaSeR.
翻译:虽然目前培训前的语言模式提供了令人惊异的背景化象征性代表,但是在培训前自我监督期间,对准确获得判决一级代表的注意较少,然而,主导目前句子代表学习的对比性目标对下游语义任务没有多少语言解释性,也没有业绩保障。我们提议基于短语重建的新颖的自我监督学习目标。为了克服先前的基因化方法的缺点,我们仔细模拟了判决内部结构,将一个句子分为几个重要短语。经验性研究表明,我们的基因化学习不仅在STS基准上取得了足够的改进,而且超越了目前最先进的对比性方法。我们的代码可以在https://github.com/chenghipanpan/PaSeR上查阅。