使用另一种语言(为库尔德语(索拉尼语)使用拖网波斯语(法西语)公司为库尔德语(索拉尼语)开发一个POS-拖网词汇) (Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus))

词性标注 · Automator · 数据集 · HTTPS · Processing（编程语言） ·

2022 年 1 月 30 日

Part of Speech Tagging (POST) of a Low-resource Language using another Language (Developing a POS-Tagged Lexicon for Kurdish (Sorani) using a Tagged Persian (Farsi) Corpus)

翻译：使用另一种语言(为库尔德语(索拉尼语)使用拖网波斯语(法西语)公司为库尔德语(索拉尼语)开发一个POS-拖网词汇)

Hossein Hassani

from arxiv, 7pages, 2 tables, 3 figures

Tagged corpora play a crucial role in a wide range of Natural Language Processing. The Part of Speech Tagging (POST) is essential in developing tagged corpora. It is time-and-effort-consuming and costly, and therefore, it could be more affordable if it is automated. The Kurdish language currently lacks publicly available tagged corpora of proper sizes. Tagging the publicly available Kurdish corpora can leverage the capability of those resources to a higher level than what raw or segmented corpora can provide. Developing POS-tagged lexicons can assist the mentioned task. We use a tagged corpus (Bijankhan corpus) in Persian (Farsi) as a close language to Kurdish to develop a POS-tagged lexicon. This paper presents the approach of leveraging the resource of a close language to Kurdish to enrich its resources. A partial dataset of the results is publicly available for non-commercial use under CC BY-NC-SA 4.0 license at https://kurdishblark.github.io/. We plan to make the whole tagged corpus available after further investigation on the outcome. The dataset can help in developing POS-tagged lexicons for other Kurdish dialects and automated Kurdish corpora tagging.

翻译：在一系列广泛的自然语言处理中,Talgged Corpora在大量自然语言处理中发挥着关键作用。在开发有标签的Corpora时,Speaking lax(POST)部分是发展有标签的Corpora(POST)的关键。它耗时费力,因此,如果是自动化的话,它可能更负担得起。库尔德语目前缺乏公开的有适当尺寸的标签Corpora。将公开提供的库尔德语公司将这些资源的能力提高到比原始的或分部分的Corpora所能提供的水平更高的水平。开发POS-标记的Lexicos(POST)可以协助上述任务。我们计划用波斯语(Farsi)的标记文件(Bijankhancamp)作为库尔德语的近距离语言来开发一个有标签的Lexicon。本文介绍了利用接近库尔德语的资源丰富其资源的方法。根据CC BY-NC-SA 4.0许可,将这些资源的部分数据集公开用于非商业用途。http://kurdishblark.githublabub.gitub.io/。我们计划在对结果进行进一步调查后,可帮助发展库尔德公司的其他数据库-Clasmacalgreglagation。

相关内容

词性标注

关注 389

词性（part-of-speech）是词汇基本的语法属性，通常也称为词类。词性标注就是在给定句子中判定每个词的语法范畴，确定其词性并加以标注的过程，是中文信息处理面临的重要基础性问题。在语料库语言学中，词性标注（POS标注或PoS标注或POST），也称为语法标注，是将文本（语料库）中的单词标注为与特定词性相对应的过程，[1] 基于其定义和上下文。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日