The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.
翻译:AI技术的迅速发展使GPT-3和ChatGPT等文本生成工具越来越容易获得、可缩放和有效,如果这些技术被用于破坏,包括科学文献和新闻来源,这可能严重威胁各种形式的媒体的可信度。尽管开发了自动参数识别方法,但由于这些方法所培训的数据集性质不同,检测这种类型的版本仍然是一项挑战。我们在本研究中审查了传统和当前用词识别的传统和当前方法,提出了经改进的参数类型。我们还调查了这种类型在流行数据集中如何代表了这种类型,以及某些类型的参数在影响检测能力方面的代表性如何不足。最后,我们概述了未来研究和数据集的新方向,以便利用AI进行更有效的参数检测。