To obtain high-quality sentence embeddings from pretrained language models, they must either be augmented with additional pretraining objectives or finetuned on large amounts of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how large pretrained language models can be leveraged to obtain high-quality embeddings without requiring any labeled data, finetuning or modifications to their pretraining objective: We utilize their generative abilities to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models. Our fully unsupervised approach outperforms strong baselines on several English semantic textual similarity datasets.
翻译:为了从经过培训的语文模式中获得高质量的嵌入式,必须增加培训前的额外目标,或者对大量贴标签的文本配对进行微调。虽然后一种方法通常优于前者,但需要人作出巨大的努力才能产生足够大小的合适数据集。在本文中,我们展示了如何利用经过培训的大型语言模式来获得高质量的嵌入式,而无需任何贴标签的数据、微调或修改其培训前目标:我们利用它们的基因化能力从零开始生成标签的文本配对的全部数据集,然后用于定期微调小得多的模型。我们完全不受监督的方法在几个英文语义文本相似数据集上形成了强大的基线。