Sound correspondence patterns form the basis of cognate detection and phonological reconstruction in historical language comparison. Methods for the automatic inference of correspondence patterns from phonetically aligned cognate sets have been proposed, but their application to multilingual wordlists requires extremely well annotated datasets. Since annotation is tedious and time consuming, it would be desirable to find ways to improve aligned cognate data automatically. Taking inspiration from trimming techniques in evolutionary biology, which improve alignments by excluding problematic sites, we propose a workflow that trims phonetic alignments in comparative linguistics prior to the inference of correspondence patterns. Testing these techniques on a large standardized collection of ten datasets with expert annotations from different language families, we find that the best trimming technique substantially improves the overall consistency of the alignments. The results show a clear increase in the proportion of frequent correspondence patterns and words exhibiting regular cognate relations.
翻译:音韵对应模式是历史语言比较中同源检测和音韵重建的基础。已经提出了从音素对齐的同源词集自动推断对应模式的方法,但将其应用于多语言词汇表需要极其精确的数据集。由于注释是繁琐而耗时的,因此需要找到自动提高对齐同源数据的方法。灵感来自进化生物学的修剪技术,该技术通过排除问题点来改善对齐效果。我们提出了一种工作流程,在推断对应模式之前,这种工作流程将比较语言学中的语音对齐修剪为一种技术。在不同语系专家注释的标准化大型数据集中测试这些技术,发现最佳修剪技术极大地提高了对齐的整体一致性。结果显示经常出现的对应模式比例和展示正则同源关系的词汇的清晰增加。