This research devoted to the low-resource Veps and Karelian languages. Algorithms for assigning part of speech tags to words and grammatical properties to words are presented in the article. These algorithms use our morphological dictionaries, where the lemma, part of speech and a set of grammatical features (gramset) are known for each word form. The algorithms are based on the analogy hypothesis that words with the same suffixes are likely to have the same inflectional models, the same part of speech and gramset. The accuracy of these algorithms were evaluated and compared. 313 thousand Vepsian and 66 thousand Karelian words were used to verify the accuracy of these algorithms. The special functions were designed to assess the quality of results of the developed algorithms. 92.4% of Vepsian words and 86.8% of Karelian words were assigned a correct part of speech by the developed algorithm. 95.3% of Vepsian words and 90.7% of Karelian words were assigned a correct gramset by our algorithm. Morphological and semantic tagging of texts, which are closely related and inseparable in our corpus processes, are described in the paper.
翻译:用于低资源 Veps 和 Karelian 语言的研究。 文章中展示了用于将部分语言和语法属性的语音标记用于文字和语法属性的部分语言标记的分类。 这些算法使用我们的形态词典, 边际词典, 边际词典和一套语法特征( 语法集), 边际词典( 语法集) 以每个单词形式著称。 这些算法基于类推假设, 同一后缀的单词可能具有相同的反动模型, 相同的语法和语法部分。 这些算法的准确性得到了评估和比较。 313 000 Vepsian 和 66 000 Karelian 字典被用于核实这些算法的准确性。 这些特殊功能旨在评估所开发的算法的质量。 92.4% 边际词典和86.8%的Karelian 字典( 语系) 被发达的算法赋予了正确的语言部分。 我们的算法为95. 3% 的Vepsian 字典和90. 的 Karelian 字典的90.7%的字典配了一个正确的克。 我们的文的文体和文体的字典的字典是密切相关的。