The use of language is subject to variation over time as well as across social groups and knowledge domains, leading to differences even in the monolingual scenario. Such variation in word usage is often called lexical semantic change (LSC). The goal of LSC is to characterize and quantify language variations with respect to word meaning, to measure how distinct two language sources are (that is, people or language models). Because there is hardly any data available for such a task, most solutions involve unsupervised methods to align two embeddings and predict semantic change with respect to a distance measure. To that end, we propose a self-supervised approach to model lexical semantic change by generating training samples by introducing perturbations of word vectors in the input corpora. We show that our method can be used for the detection of semantic change with any alignment method. Furthermore, it can be used to choose the landmark words to use in alignment and can lead to substantial improvements over the existing techniques for alignment. We illustrate the utility of our techniques using experimental results on three different datasets, involving words with the same or different meanings. Our methods not only provide significant improvements but also can lead to novel findings for the LSC problem.
翻译:语言的使用随时间以及社会团体和知识领域的变化而变化,导致即使是单一语言的假设情况也存在差异。语言使用中的这种差异往往被称为词汇语义变化(LSC) 。 LSC的目标是在文字含义方面对语言差异进行定性和量化,以衡量两种语言来源(即人或语言模式)的不同程度。由于几乎没有可用于这一任务的任何数据,大多数解决方案都涉及未经监督的方法,以对两种嵌入进行统一,并预测与距离测量有关的语义变化。为此,我们建议采用自我监督的方法,通过在输入体中引入单词矢量的干扰来生成培训样本来模拟词汇语义变化模式。我们表明,我们的方法可以用来用任何校正方法来检测语义变化。此外,可以使用这种方法选择用于校正的标志性词句,并能够对现有校正技术进行重大改进。我们用三个不同数据集的实验结果来说明我们技术的效用,其中含有相同或不同含义的词义。我们的方法不仅能够提供重要的改进,而且能够导致创新的发现。