Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.
翻译:在一系列广泛的自然语言处理中,Talgged Corpora在大量自然语言处理中发挥着关键作用。在开发有标签的Corpora时,Speaking lax(POST)部分是发展有标签的Corpora(POST)的关键。它耗时费力,因此,如果是自动化的话,它可能更负担得起。库尔德语目前缺乏公开的有适当尺寸的标签Corpora。将公开提供的库尔德语公司将这些资源的能力提高到比原始的或分部分的Corpora所能提供的水平更高的水平。开发POS-标记的Lexicos(POST)可以协助上述任务。我们计划用波斯语(Farsi)的标记文件(Bijankhancamp)作为库尔德语的近距离语言来开发一个有标签的Lexicon。本文介绍了利用接近库尔德语的资源丰富其资源的方法。根据CC BY-NC-SA 4.0许可,将这些资源的部分数据集公开用于非商业用途。http://kurdishblark.githublabub.gitub.io/。我们计划在对结果进行进一步调查后,可帮助发展库尔德公司的其他数据库-Clasmacalgreglagation。