Pre-trained contextual language models are ubiquitously employed for language understanding tasks, but are unsuitable for resource-constrained systems. Noncontextual word embeddings are an efficient alternative in these settings. Such methods typically use one vector to encode multiple different meanings of a word, and incur errors due to polysemy. This paper proposes a two-stage method to distill multiple word senses from a pre-trained language model (BERT) by using attention over the senses of a word in a context and transferring this sense information to fit multi-sense embeddings in a skip-gram-like framework. We demonstrate an effective approach to training the sense disambiguation mechanism in our model with a distribution over word senses extracted from the output layer embeddings of BERT. Experiments on the contextual word similarity and sense induction tasks show that this method is superior to or competitive with state-of-the-art multi-sense embeddings on multiple benchmark data sets, and experiments with an embedding-based topic model (ETM) demonstrates the benefits of using this multi-sense embedding in a downstream application.
翻译:预先训练的语境化语言模型普遍用于语言理解任务,但在资源受限的系统中不太适用。非语境词向量是这些情况下的一种高效替代方法。这种方法通常使用一个向量来编码一个词的多个不同含义,并由于词义的多义性而产生误差。本文提出了一种两阶段方法,以从预训练语言模型(BERT)中提炼出多个词义,并通过使用上下文中一个单词的各种意义之间的注意力,将该词义信息转移到类似于跳字模型框架中的多义词嵌入中。我们演示了一种有效的方法,用于从BERT的输出层嵌入中提取的词义分布中训练我们模型中的词义消歧机制。在上下文相似性和词义感知任务的实验中,该方法在多个基准数据集上优于或与最先进的多义词嵌入相竞争,并且使用基于嵌入的主题模型(ETM)的实验展示了在下游应用中使用这种多义词嵌入的好处。