汇编:中国复杂易控定义生成基准数据集 (COMPILING: A Benchmark Dataset for Chinese Complexity Controllable Definition Generation)

The definition generation task aims to generate a word's definition within a specific context automatically. However, owing to the lack of datasets for different complexities, the definitions produced by models tend to keep the same complexity level. This paper proposes a novel task of generating definitions for a word with controllable complexity levels. Correspondingly, we introduce COMPILING, a dataset given detailed information about Chinese definitions, and each definition is labeled with its complexity levels. The COMPILING dataset includes 74,303 words and 106,882 definitions. To the best of our knowledge, it is the largest dataset of the Chinese definition generation task. We select various representative generation methods as baselines for this task and conduct evaluations, which illustrates that our dataset plays an outstanding role in assisting models in generating different complexity-level definitions. We believe that the COMPILING dataset will benefit further research in complexity controllable definition generation.

翻译：定义生成任务的目的是在特定背景下自动生成单词定义。然而,由于缺乏不同复杂程度的数据集,模型产生的定义往往保持同样的复杂程度。本文件提出了为具有可控复杂程度的单词生成定义的新任务。相应的,我们引入了Compiling,一个关于中国定义的详细信息数据集,每个定义都有其复杂程度的标签。Compiling数据集包括74,303个单词和106,882个定义。据我们所知,它是中国定义生成任务中最大的数据集。我们选择了各种具有代表性的生成方法作为这项任务的基线并进行评估,这表明我们的数据集在协助模型生成不同复杂程度定义方面发挥了杰出的作用。我们认为,Compling数据集将有利于对复杂可控定义生成的进一步研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日