To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how large PLMs can be leveraged to obtain high-quality embeddings without requiring any labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of PLMs to generate entire datasets of labeled text pairs from scratch, which can then be used for regular finetuning of much smaller models. Our fully unsupervised approach outperforms strong baselines on several English semantic textual similarity datasets.
翻译:为了从经过训练的语文模型中获取高质量的嵌入式,必须增加培训前目标,或者对一大批有标签的文本配对进行微调。虽然后一种方法通常优于前者,但需要人作出巨大的努力才能产生足够大小的合适数据集。在本文中,我们展示如何利用大型的PLM来获取高质量的嵌入式,而无需任何标签数据、微调或修改培训前目标:我们利用PLM的基因化能力从零开始生成全套有标签的文本配对数据集,然后用于对小得多的模型进行定期微调。我们完全无监督的方法在几个英文语义文本相似数据集上形成了强大的基线。