We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term (or at least the status of these shifts is questionable). Such challenging cases are discussed in detail with examples, and their linguistic categorization is proposed. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance, which naturally stem from their distributional nature, but is different from the types of issues observed in methods based on static embeddings. Additionally, they often merge together syntactic and semantic aspects of lexical entities. We propose a range of possible future solutions to these issues.
翻译:我们对基于背景嵌入的嵌入方法的(潜在错误)产出进行定性分析,以发现地氏语义变化。首先,我们采用一种混合方法,优于先前描述的背景化方法。这种方法被用来作为深入分析预测的英文文字在50年中的语义变化程度的基础。我们的研究结果表明,基于背景的方法往往可以预测在词法学意义上没有发生任何真正的地氏语义变化的单词的高变化分数(或至少这些变化的状况值得怀疑 ) 。这些具有挑战性的案例以实例详细讨论,并提出了语言分类。我们的结论是,预先经过培训的背景化语言模式容易混淆地理解和背景差异的变化,这些变化自然源于其分布性质,但与基于静态嵌入的方法所观察到的问题类型不同。此外,它们往往将词法实体的合成和语义性方面结合起来。我们提出了一系列可能的未来解决办法。