学术出版物自动生成文本探测基准体 (A Benchmark Corpus for the Detection of Automatically Generated Text in Academic Publications)

Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.

翻译：基于神经语言模型的自动生成文本已经达到性能水平,使得生成的文本几乎无法与人类所写文本区分。尽管文本生成在各种应用中具有价值,但也可以用于恶意任务。这些做法的传播对学术出版的质量构成威胁。为了解决这些问题,我们在本文件中提出由人工生成的研究内容组成的两个数据集:一个完全合成的数据集和一个部分文本替代数据集。在第一种情况下,内容完全由GPT-2模型在从原始文件中简短提取后产生的。部分或混合数据集是用Arxiv-NLP模型生成的句子取代的若干句子来创建的。我们用流利度指标(如BLEU和ROUGE)对生成的文本与原始文本的对齐进行比较的数据集质量进行评估。人工文本看上去越自然,就越难检测,就越好的基准。我们还评估了使用状态分类模型区分原始文本和原始文本的困难。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

最新《Transformers模型》教程，64页ppt

专知会员服务

321+阅读 · 2020年11月26日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日