Automatic text generation based on neural language models has achieved performance levels that make the generated text almost indistinguishable from those written by humans. Despite the value that text generation can have in various applications, it can also be employed for malicious tasks. The diffusion of such practices represent a threat to the quality of academic publishing. To address these problems, we propose in this paper two datasets comprised of artificially generated research content: a completely synthetic dataset and a partial text substitution dataset. In the first case, the content is completely generated by the GPT-2 model after a short prompt extracted from original papers. The partial or hybrid dataset is created by replacing several sentences of abstracts with sentences that are generated by the Arxiv-NLP model. We evaluate the quality of the datasets comparing the generated texts to aligned original texts using fluency metrics such as BLEU and ROUGE. The more natural the artificial texts seem, the more difficult they are to detect and the better is the benchmark. We also evaluate the difficulty of the task of distinguishing original from generated text by using state-of-the-art classification models.
翻译:基于神经语言模型的自动生成文本已经达到性能水平,使得生成的文本几乎无法与人类所写文本区分。尽管文本生成在各种应用中具有价值,但也可以用于恶意任务。这些做法的传播对学术出版的质量构成威胁。为了解决这些问题,我们在本文件中提出由人工生成的研究内容组成的两个数据集:一个完全合成的数据集和一个部分文本替代数据集。在第一种情况下,内容完全由GPT-2模型在从原始文件中简短提取后产生的。部分或混合数据集是用Arxiv-NLP模型生成的句子取代的若干句子来创建的。我们用流利度指标(如BLEU和ROUGE)对生成的文本与原始文本的对齐进行比较的数据集质量进行评估。人工文本看上去越自然,就越难检测,就越好的基准。我们还评估了使用状态分类模型区分原始文本和原始文本的困难。