One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.
翻译:在语言技术中,低资源和濒危语言社区面临的主要挑战之一是缺乏或缺乏语言数据。这也是库尔德语和拉基语南部口音的情况,这些语言的资源非常有限,在工具方面进展微不足道。为了解决这个问题,我们提出了一些方法,这些方法依赖于当地新闻网站的内容、播放南库尔德语内容的当地广播电台和拉基语的实地工作。在本文中,我们描述了这些少数语言面临的一些挑战,特别是在书写和标准化方面,以及检索数据来源并将手写内容进行逆向数字化以创建南库尔德语和拉基语的语料库的方法。此外,我们还研究了在库尔德语和扎扎-高拉尼语的其他变体的背景下进行语言识别的任务。