We present a word-sense induction method based on pre-trained masked language models (MLMs), which can cheaply scale to large vocabularies and large corpora. The result is a corpus which is sense-tagged according to a corpus-derived sense inventory and where each sense is associated with indicative words. Evaluation on English Wikipedia that was sense-tagged using our method shows that both the induced senses, and the per-instance sense assignment, are of high quality even compared to WSD methods, such as Babelfy. Furthermore, by training a static word embeddings algorithm on the sense-tagged corpus, we obtain high-quality static senseful embeddings. These outperform existing senseful embeddings techniques on the WiC dataset and on a new outlier detection dataset we developed. The data driven nature of the algorithm allows to induce corpora-specific senses, which may not appear in standard sense inventories, as we demonstrate using a case study on the scientific domain.
翻译:我们提出了一个基于预先训练的隐蔽语言模型(MLMS)的感官感应方法,该方法可以廉价地适用于大型词汇和大型公司,其结果是,一个根据来自物理的感官清单和每种感官都与指示性词相联系的感官标记体。用我们的方法对英文维基百科进行了感应标记的评价表明,即使与WSD方法(如Babilfy)相比,诱导感官和常识感官任务都具有很高的质量。此外,通过在感官标记体上培训静态的单词嵌入算法,我们获得了高质量的静态感官嵌入式嵌入器。这些在WIC数据集和我们开发的新的外部探测数据集上已经存在的感知性嵌入技术已经超出我们所开发的。数据驱动特性允许产生特定感官,而这种感官可能没有出现在标准意义上的目录中,我们用科学领域的案例研究来证明。