AI generated content (AIGC) presents considerable challenge to educators around the world. Instructors need to be able to detect such text generated by large language models, either with the naked eye or with the help of some tools. There is also growing need to understand the lexical, syntactic and stylistic features of AIGC. To address these challenges in English language teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models in response to essay prompts from three sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing tasks. Machine-generated texts are paired with roughly equal number of human-written essays with three score levels matched in essay prompts. We then hire English instructors to distinguish machine essays from human ones. Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them. But the number rises to 67% after one round of minimal self-training. Next, we perform linguistic analyses of these essays, which show that machines produce sentences with more complex syntactic structures while human essays tend to be lexically more complex. Finally, we test existing AIGC detectors and build our own detectors using SVMs and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of ArguGPT achieves above 90% accuracy in both essay- and sentence-level classification. To the best of our knowledge, this is the first comprehensive analysis of argumentative essays produced by generative large language models. Machine-authored essays in ArguGPT and our models will be made publicly available at https://github.com/huhailinguist/ArguGPT
翻译:人工智能生成的内容(AIGC)对全球教育工作者提出了严峻挑战。教师需要能够用肉眼或工具检测出由大型语言模型生成的文本。同时,对AIGC的词汇、句法和文体特征的理解也日益重要。为了解决英语教育中的这些挑战,我们首先提供了ArguGPT,这是一个由7个GPT模型生成的4,038个论文的均衡语料库,这些论文是根据三个来源的论文题目生成的:(1)课堂作业,(2)托福和(3)GRE写作任务。机器生成的文本与标准的人类写作论文数量大致相等,并且在论文主题和三个分数级别上匹配。然后,我们聘请英语教师区分机器论文和人类论文。结果表明,当首次接触机器生成的论文时,教师只有61%的准确率来检测它们。但是在一轮最小自学后,这个数字上升到67%。接下来,我们对这些论文进行语言分析,发现机器产生的句子具有更复杂的句法结构,而人类文章往往在词汇上更加复杂。最后,我们测试了现有的AIGC探测器,并使用SVM和RoBERTa构建了自己的探测器。结果表明,经过ArguGPT训练集微调的RoBERTa在论文和句子级别的分类上都可以达到90%以上的准确率。据我们所知,这是第一个对由生成性大型语言模型生成的论证性论文进行全面分析的研究。ArguGPT和我们的模型中的机器作者文章将在https://github.com/huhailinguist/ArguGPT上公开发布。