SciXGen:用于生成上下文意识文字的科学纸张数据集 (SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation)

Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called \textit{context}. We push forward the scientific text generation by proposing a new task, namely \textbf{context-aware text generation} in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale \textbf{Sci}entific Paper Dataset for Conte\textbf{X}t-Aware Text \textbf{Gen}eration (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.

翻译：在科学文件中生成文本不仅需要捕捉特定投入中的内容,而且需要经常获取称为\ textit{context}的外部信息。我们通过在科学领域提出一项新的任务,即\ textbf{context-aware 文本生成}来推动科学文本的生成,目的是利用生成文本中背景贡献。为此,我们提出了一个具有挑战性的新型大型大型 \ textbf{X}t-Aware Text 文本{textbf{Gen}eration (SciXGen) 数据集,由205,304篇论文组成,其中充分提到广泛使用的对象(例如表格、数字、算法),我们用最新资料全面衡量我们新建的SciXGen数据集在生成描述和段落方面的效率。我们的数据集和基准将公开,希望为科学文本生成研究提供便利。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【EMNLP2020】自然语言生成，Neural Language Generation

专知会员服务

39+阅读 · 2020年11月20日

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日