We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of substitute candidates, reducing morphophonetic or morphosyntactic biases induced by article-noun agreement.
翻译:我们建议了一种使用预先培训的语言模型进行新的不受监督的词汇替代方法。 与以前使用语言模型基因化能力预测替代品的方法相比,我们的方法基于背景化和不通俗化词嵌入的相似性,即多个背景下一个单词的平均背景表述。 我们用英语和意大利语进行实验,并表明我们的方法大大优于强势基线,并建立了一个没有明确监督或微调的新的先进技术。 我们还进一步表明,我们的方法在预测低频率替代品方面表现得特别好,还产生了多样化的替代候选人名单,减少了条款名词协议引发的单语或定型偏见。