Semantic relatedness between words is one of the core concepts in natural language processing, thus making semantic evaluation an important task. In this paper, we present a semantic model evaluation dataset: SimRelUz - a collection of similarity and relatedness scores of word pairs for the low-resource Uzbek language. The dataset consists of more than a thousand pairs of words carefully selected based on their morphological features, occurrence frequency, semantic relation, as well as annotated by eleven native Uzbek speakers from different age groups and gender. We also paid attention to the problem of dealing with rare words and out-of-vocabulary words to thoroughly evaluate the robustness of semantic models.
翻译:语言之间的语义关联是自然语言处理中的核心概念之一,因此使语义评估成为一项重要任务。在本文中,我们提出了一个语义模型评价数据集:SimRelUz,这是一套关于低资源乌兹别克语言的相似和关联词对数的汇编。该数据集包括1,000多对根据其形态特征、发生频率、语义关系以及来自不同年龄组和性别的11位乌兹别克本地语发言者的注解而精心挑选的单词。我们还关注处理稀有词和词汇外词的问题,以便彻底评估语义模型的稳健性。