The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where ``unconventional'' writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
翻译:波斯-阿拉伯字母是一组被世界上各种语言社区广泛采用和使用的字母。使用这种字母来识别不同语言对于语言技术在低资源环境下尤其具有挑战性。因此,本文着重探讨了采用波斯-阿拉伯字母进行语言检测面临的挑战,特别是在双语社区中进行“非传统”书写的情况。为了解决这个问题,我们使用了一组监督技术将句子分类到它们的语言中。在此基础上,我们还提出了一种针对分类器更容易混淆的语言集群的分层模型。实验结果表明,我们的解决方案是有效的。