A morphological analyzer, which is a significant component of many natural language processing applications especially for morphologically rich languages, divides an input word into all its composing morphemes and identifies their morphological roles. In this paper, we introduce a comprehensive morphological analyzer for Central Kurdish (CK), a low-resourced language with a rich morphology. Building upon the limited existing literature, we first assembled and systematically categorized a comprehensive collection of the morphological and morphophonological rules of the language. Additionally, we collected and manually labeled a generative lexicon containing nearly 10,000 verb, noun and adjective stems, named entities, and other types of word stems. We used these rule sets and resources to implement CKMorph Analyzer based on finite-state transducers. In order to provide a benchmark for future research, we collected, manually labeled, and publicly shared test sets for evaluating accuracy and coverage of the analyzer. CKMorph was able to correctly analyze 95.9% of the accuracy test set, containing 1,000 CK words morphologically analyzed according to the context. Moreover, CKMorph gave at least one analysis for 95.5% of 4.22M CK tokens of the coverage test set. The demonstration of the application and resources including CK verb database and test sets are openly accessible at https://github.com/CKMorph.
翻译:形态分析器是许多自然语言处理应用中的重要组成部分,特别是形态丰富的语言,是许多自然语言处理应用中的重要组成部分。形态分析器将一个输入单词分解到所有构成的模形体中,并确定其形态作用。在本文中,我们为中央库尔德语(CK)引入了一个全面的形态分析器(CK),这是一种资源少的语言,具有丰富的形态学。在有限的现有文献的基础上,我们首先收集并系统地分类了该语言的形态和形态学规则的综合汇编。此外,我们收集并手工将一个包含近10,000个动词、名词和形容词根、命名实体和其他类型的词源的基因化词汇标出了一个词。我们使用这些规则集和资源来实施基于限定状态的变形器的CKorph Analyzer。为了提供未来研究的基准,我们收集、手工标签和公开分享了该语言分析器的准确性和覆盖面。CKMorph能够准确分析95.9%的精度测试器,其中含有1,000个CK字和形容词,根据背景,对C.M.M. 22 测试资源进行了最起码的C.