假冒它直到你做它: 单语单语嵌入任务自操作的语义变换 (Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks)

The use of language is subject to variation over time as well as across social groups and knowledge domains, leading to differences even in the monolingual scenario. Such variation in word usage is often called lexical semantic change (LSC). The goal of LSC is to characterize and quantify language variations with respect to word meaning, to measure how distinct two language sources are (that is, people or language models). Because there is hardly any data available for such a task, most solutions involve unsupervised methods to align two embeddings and predict semantic change with respect to a distance measure. To that end, we propose a self-supervised approach to model lexical semantic change by generating training samples by introducing perturbations of word vectors in the input corpora. We show that our method can be used for the detection of semantic change with any alignment method. Furthermore, it can be used to choose the landmark words to use in alignment and can lead to substantial improvements over the existing techniques for alignment. We illustrate the utility of our techniques using experimental results on three different datasets, involving words with the same or different meanings. Our methods not only provide significant improvements but also can lead to novel findings for the LSC problem.

翻译：语言的使用随时间以及社会团体和知识领域的变化而变化,导致即使是单一语言的假设情况也存在差异。语言使用中的这种差异往往被称为词汇语义变化(LSC) 。 LSC的目标是在文字含义方面对语言差异进行定性和量化,以衡量两种语言来源(即人或语言模式)的不同程度。由于几乎没有可用于这一任务的任何数据,大多数解决方案都涉及未经监督的方法,以对两种嵌入进行统一,并预测与距离测量有关的语义变化。为此,我们建议采用自我监督的方法,通过在输入体中引入单词矢量的干扰来生成培训样本来模拟词汇语义变化模式。我们表明,我们的方法可以用来用任何校正方法来检测语义变化。此外,可以使用这种方法选择用于校正的标志性词句,并能够对现有校正技术进行重大改进。我们用三个不同数据集的实验结果来说明我们技术的效用,其中含有相同或不同含义的词义。我们的方法不仅能够提供重要的改进,而且能够导致创新的发现。

相关内容

词向量表示

关注 0

分散式表示即将语言表示为稠密、低维、连续的向量。研究者最早发现学习得到词嵌入之间存在类比关系。比如apple−apples ≈ car−cars， man−woman ≈ king – queen 等。这些方法都可以直接在大规模无标注语料上进行训练。词嵌入的质量也非常依赖于上下文窗口大小的选择。通常大的上下文窗口学到的词嵌入更反映主题信息，而小的上下文窗口学到的词嵌入更反映词的功能和上下文语义信息。

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日

商业数据分析，39页ppt

专知会员服务

165+阅读 · 2020年6月2日

【Google-Mila】你的GAN实际上是一个基于能量的模型，你应该使用鉴别器驱动的潜在采样，Your GAN is Secretly an Energy-based Model and You Should Use Discriminator Driven Latent Sampling

专知会员服务

30+阅读 · 2020年3月28日