Detecting lexical semantic change in smaller data sets, e.g. in historical linguistics and digital humanities, is challenging due to a lack of statistical power. This issue is exacerbated by non-contextual embedding models that produce one embedding per word and, therefore, mask the variability present in the data. In this article, we propose an approach to estimate semantic shift by combining contextual word embeddings with permutation-based statistical tests. We use the false discovery rate procedure to address the large number of hypothesis tests being conducted simultaneously. We demonstrate the performance of this approach in simulation where it achieves consistently high precision by suppressing false positives. We additionally analyze real-world data from SemEval-2020 Task 1 and the Liverpool FC subreddit corpus. We show that by taking sample variation into account, we can improve the robustness of individual semantic shift estimates without degrading overall performance.
翻译:由于缺乏统计力量,检测历史语言学和数字人文学等较小数据集的语义变化具有挑战性,因为缺乏统计力量,这一问题因非理论嵌入模型而加剧,这些模型产生一个单字嵌入,因此掩盖了数据中存在的变异性。在本篇文章中,我们提出一种方法,通过将上下文词嵌入与基于变异的统计测试相结合来估计语义变化。我们使用假发现率程序来解决同时进行的大量假设测试。我们在模拟中展示了这一方法的性能,通过抑制假阳性实现一贯的高度精确性。我们进一步分析了SemEval-2020任务1和Liverpool FC子编集的真实世界数据。我们表明,通过将样本变化考虑在内,我们可以在不降低总体性能的情况下提高个体语义转移估计数的稳健性。