Word Sense Disambiguation (WSD) is a long-standing task in Natural Language Processing(NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evaluating different WSD systems in the language of interest. Although many WSD test collections have been developed for a variety of languages, no standard All-words WSD benchmark is available for Persian. In this paper, we address this shortage for the Persian language by introducing SBU-WSD-Corpus, as the first standard test set for the Persian All-words WSD task. SBU-WSD-Corpus is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. To this end, three annotators used SAMP (a tool for sense annotation based on FarsNet lexical graph) to perform the annotation task. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs). Providing baselines for future studies on the Persian All-words WSD task, we evaluate several WSD models on SBU-WSD-Corpus. The corpus is publicly available at https://github.com/hrouhizadeh/SBU-WSD-Corpus.
翻译:Wordense 模糊不清( WSSD) 是自然语言处理( NLP) 中的一项长期任务, 目的是自动识别特定背景下最相关的词义。 开发标准 WSD 测试收藏可以被提到为以感兴趣的语言开发和评价不同的 WSD 系统的重要先决条件。 虽然许多 WSD 测试收藏是为多种语言开发的, 但波斯语没有标准的所有词 WSD 基准。 本文中, 我们通过引入 SBU- WSD- Corpus 来解决波斯语的短缺问题, 作为波斯所有字 WSD 任务的第一个标准测试集。 SBU- WSD- Corpus 是手动用波斯WordNet (FarsNet) 感知觉清单的感官附加说明的 WSSD 。 为此, 三位SSD 使用SAMP (一个基于 FarsNet 词汇图的感知说明工具) 来进行批注任务。 SBE- WSD- Corpus 等不同领域的19份波斯文件。 它包括正在运行的文本中的5892字词, WSD- WSD- serb- serb- sad sad sad sad 。 在SD 上提供所有的SBSBSBR 573 的SBSD 的SBSBSB- s bli 和SD (201 510 的SD) 的SB- sAL 的SB- salbliblex 。 。