Kyrgyz remains a low-resource language with limited foundational NLP tools. To address this gap, we introduce KyrgyzBERT, the first publicly available monolingual BERT-based language model for Kyrgyz. The model has 35.9M parameters and uses a custom tokenizer designed for the language's morphological structure. To evaluate performance, we create kyrgyz-sst2, a sentiment analysis benchmark built by translating the Stanford Sentiment Treebank and manually annotating the full test set. KyrgyzBERT fine-tuned on this dataset achieves an F1-score of 0.8280, competitive with a fine-tuned mBERT model five times larger. All models, data, and code are released to support future research in Kyrgyz NLP.
翻译:吉尔吉斯语作为一种低资源语言,其基础自然语言处理工具较为匮乏。为填补这一空白,我们提出了KyrgyzBERT,这是首个公开可用的基于BERT的吉尔吉斯语单语语言模型。该模型参数量为3590万,并采用了针对该语言形态结构设计的定制分词器。为评估性能,我们构建了kyrgyz-sst2情感分析基准数据集,该数据集通过翻译斯坦福情感树库并人工标注完整测试集而创建。在此数据集上微调的KyrgyzBERT获得了0.8280的F1分数,其性能可与参数量为其五倍的微调mBERT模型相媲美。所有模型、数据及代码均已开源,以支持未来吉尔吉斯语自然语言处理领域的研究。