Historical linguists have identified regularities in the process of historic sound change. The comparative method utilizes those regularities to reconstruct proto-words based on observed forms in daughter languages. Can this process be efficiently automated? We address the task of proto-word reconstruction, in which the model is exposed to cognates in contemporary daughter languages, and has to predict the proto word in the ancestor language. We provide a novel dataset for this task, encompassing over 8,000 comparative entries, and show that neural sequence models outperform conventional methods applied to this task so far. Error analysis reveals variability in the ability of neural model to capture different phonological changes, correlating with the complexity of the changes. Analysis of learned embeddings reveals the models learn phonologically meaningful generalizations, corresponding to well-attested phonological shifts documented by historical linguistics.
翻译:历史语言学家已经确定了历史健康变化过程中的规律性。 比较方法利用这些规律性来重建基于观察的女方语言形式的原言。 这一过程能否实现高效自动化? 我们处理原言重建的任务, 模型以当代女方语言接触白兰地, 并且必须预测祖先语言中的原言。 我们为此任务提供了一套新颖的数据集, 包括8 000多个比较条目, 并显示神经序列模型超过迄今为止用于这项任务的常规方法。 错误分析揭示了神经模型在捕捉不同音调变化的能力上的变异性, 与变化的复杂性相关。 所学的嵌入分析揭示了模型学习的声学上有意义的概括, 与历史语言记录下来的经过充分测试的声学变化相对应。