Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.
翻译:在本文中,我们概述了各种基于物理的模型的性能概览,特别是深度学习(DL)模型,任务是进行参数探测。我们报告了八种模型(LSI、TF-IDF、Word2Vec、Doc2Vec、GloVe、FastText、ELMO、USE)的结果,这些模型对三种不同的现有公共公司进行了评估:微软研究Porporas、Clough和Stevenson以及Webis Crowd Parad Corporus。我们通过大量实验,决定了最合适的文本预处理方法:超参数、次级模型选择(例如,Skippgram诉CBOW)、距离测量和语义相似/语言相似的临界值。我们和其他研究人员使用深层学习模型的研究结果显示,DL模型与传统的状态方法和潜力应该进一步开发。