This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.
翻译:本文为太平洋地区低语言和资源不足语言提供了语言识别模型,重点是以前没有的澳大利亚语言。准确的语言识别是开发语言资源的一个重要部分。本文采取的方法将29种澳洲语言与171种非澳洲语言结合起来,以创建一套从8个数据来源得出的评价。在对6种语言识别方法进行评估之后,我们发现基于跳格嵌入的分类方法比替代方法的功能要高得多。然后我们系统地将模型中非澳洲语言的数量增加到总共800种语言,以评估增加的语言目录是否导致对奥士罗尼亚语相关语言作出不那么精确的预测。这一评估发现,增加非澳斯特罗尼西亚语言的目录只会对准确性产生极小的影响。进一步实验将这些语言识别模型用于检测代码开关,在所有29种语言中实现高度准确性。