SynSciPass:发现科学文本生成的适当用途 (SynSciPass: detecting appropriate uses of scientific text generation)

from arxiv, Accepted to SDProc 2022 Workshop @ COLING 2022 (First Place Submission on https://www.kaggle.com/competitions/detecting-generated-scientific-papers/leaderboard)

Approaches to machine generated text detection tend to focus on binary classification of human versus machine written text. In the scientific domain where publishers might use these models to examine manuscripts under submission, misclassification has the potential to cause harm to authors. Additionally, authors may appropriately use text generation models such as with the use of assistive technologies like translation tools. In this setting, a binary classification scheme might be used to flag appropriate uses of assistive text generation technology as simply machine generated which is a cause of concern. In our work, we simulate this scenario by presenting a state-of-the-art detector trained on the DAGPap22 with machine translated passages from Scielo and find that the model performs at random. Given this finding, we develop a framework for dataset development that provides a nuanced approach to detecting machine generated text by having labels for the type of technology used such as for translation or paraphrase resulting in the construction of SynSciPass. By training the same model that performed well on DAGPap22 on SynSciPass, we show that not only is the model more robust to domain shifts but also is able to uncover the type of technology used for machine generated text. Despite this, we conclude that current datasets are neither comprehensive nor realistic enough to understand how these models would perform in the wild where manuscript submissions can come from many unknown or novel distributions, how they would perform on scientific full-texts rather than small passages, and what might happen when there is a mix of appropriate and inappropriate uses of natural language generation.

翻译：机器生成的文本检测方法往往侧重于人与机器书面文本的二进制分类。在出版商可能利用这些模型来检查提交中的手稿的科学领域,错误分类有可能对作者造成损害。此外,作者可能适当使用文本生成模型,例如使用辅助技术,例如翻译工具。在这种背景下,可以使用二进制分类办法,将辅助文本生成技术的适当使用标为简单的生成机器,这令人关切。在我们的工作中,我们通过展示一个在DAGPap22上用Scielo机器翻译版本培训的最先进的检测器来模拟这一情景。在科学领域,出版商可能会使用这些模型来检查正在提交的手稿,发现该模型是随机的。鉴于这一发现,我们为数据集开发开发了一个框架,通过对诸如翻译或引文等技术类型贴标签来检测机器生成的文本,从而导致SynSciPass的构建。通过在SynSciPass上对DAGP22上表现良好的模型进行培训,我们表明,不仅对域变换的模型更为可靠,而且对于当前版本的版本的版本也无法发现,而我们所制作的机器的版本又如何正确理解。