《深层学习的参数识别:数据集和方法审查》 (Paraphrase Identification with Deep Learning: A Review of Datasets and Methods)

The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.

翻译：AI技术的迅速发展使GPT-3和ChatGPT等文本生成工具越来越容易获得、可缩放和有效,如果这些技术被用于破坏,包括科学文献和新闻来源,这可能严重威胁各种形式的媒体的可信度。尽管开发了自动参数识别方法,但由于这些方法所培训的数据集性质不同,检测这种类型的版本仍然是一项挑战。我们在本研究中审查了传统和当前用词识别的传统和当前方法,提出了经改进的参数类型。我们还调查了这种类型在流行数据集中如何代表了这种类型,以及某些类型的参数在影响检测能力方面的代表性如何不足。最后,我们概述了未来研究和数据集的新方向,以便利用AI进行更有效的参数检测。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日