以公司为基础的参数句探测实验和审查 (Corpus-Based Paraphrase Detection Experiments and Review)

Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

翻译：在本文中,我们概述了各种基于物理的模型的性能概览,特别是深度学习(DL)模型,任务是进行参数探测。我们报告了八种模型(LSI、TF-IDF、Word2Vec、Doc2Vec、GloVe、FastText、ELMO、USE)的结果,这些模型对三种不同的现有公共公司进行了评估:微软研究Porporas、Clough和Stevenson以及Webis Crowd Parad Corporus。我们通过大量实验,决定了最合适的文本预处理方法:超参数、次级模型选择(例如,Skippgram诉CBOW)、距离测量和语义相似/语言相似的临界值。我们和其他研究人员使用深层学习模型的研究结果显示,DL模型与传统的状态方法和潜力应该进一步开发。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/