A sememe is defined as the minimum semantic unit of human languages. Sememe knowledge bases (KBs), which contain words annotated with sememes, have been successfully applied to many NLP tasks. However, existing sememe KBs are built on only a few languages, which hinders their widespread utilization. To address the issue, we propose to build a unified sememe KB for multiple languages based on BabelNet, a multilingual encyclopedic dictionary. We first build a dataset serving as the seed of the multilingual sememe KB. It manually annotates sememes for over $15$ thousand synsets (the entries of BabelNet). Then, we present a novel task of automatic sememe prediction for synsets, aiming to expand the seed dataset into a usable KB. We also propose two simple and effective models, which exploit different information of synsets. Finally, we conduct quantitative and qualitative analyses to explore important factors and difficulties in the task. All the source code and data of this work can be obtained on https://github.com/thunlp/BabelNet-Sememe-Prediction.
翻译:将一个 sememe 定义为 人类语言的最小语义单位 。 Sememee 知识基础 (KBs) 包含 sememe 的词, 已经成功地应用于许多 NLP 任务 。 然而, 现有的 semememe KBs 仅建在少数语言之上, 妨碍其广泛使用 。 为了解决这个问题, 我们提议在 BabelNet (多语言百科全书字典) 的基础上为多种语言建立一个统一的 seme KB 。 我们首先建立一个数据集, 作为多语言 seme KB 的种子 。 它手动为 超过 $ 5 000 synsets ( BabelNet 条目) 的 EMemesemes ( $ $ $ $ $ $ $ 500 synsetets ) 自动 预测 。 然后, 我们提出一个新的任务任务任务任务是, 将 以 https://giuthub. com/thunlimp/ Babrem- selem- sypeal- sypeal- sat- data 。