Type- and token-based embedding architectures are still competing in lexical semantic change detection. The recent success of type-based models in SemEval-2020 Task 1 has raised the question why the success of token-based models on a variety of other NLP tasks does not translate to our field. We investigate the influence of a range of variables on clusterings of BERT vectors and show that its low performance is largely due to orthographic information on the target word, which is encoded even in the higher layers of BERT representations. By reducing the influence of orthography we considerably improve BERT's performance.
翻译:以类型和象征性为基础的嵌入结构在测算语义变化方面仍然相互竞争。SemEval-2020任务1中基于类型模型的最近成功提出了这样一个问题:为什么其他各种非常规任务基于象征性模型的成功没有转化为我们的领域。我们调查一系列变量对BERT矢量群集的影响,并表明其低效主要归因于目标词的拼写信息,甚至在BERT代表的较高层次上都对目标词进行了编码。通过减少拼写的影响,我们大大改进了BERT的性能。