This paper addresses the quality issues in existing Twitter-based paraphrase datasets, and discusses the necessity of using two separate definitions of paraphrase for identification and generation tasks. We present a new Multi-Topic Paraphrase in Twitter (MultiPIT) corpus that consists of a total of 130k sentence pairs with crowdsoursing (MultiPIT_crowd) and expert (MultiPIT_expert) annotations using two different paraphrase definitions for paraphrase identification, in addition to a multi-reference test set (MultiPIT_NMR) and a large automatically constructed training set (MultiPIT_Auto) for paraphrase generation. With improved data annotation quality and task-specific paraphrase definition, the best pre-trained language model fine-tuned on our dataset achieves the state-of-the-art performance of 84.2 F1 for automatic paraphrase identification. Furthermore, our empirical results also demonstrate that the paraphrase generation models trained on MultiPIT_Auto generate more diverse and high-quality paraphrases compared to their counterparts fine-tuned on other corpora such as Quora, MSCOCO, and ParaNMT.
翻译:本文讨论现有基于Twitter的参数数据集的质量问题,并讨论使用两种不同的参数定义来进行识别和生成任务的必要性。我们在Twitter(MultiPIT)文集中提出了一个新的多语句新词句,该词句由总共130k对的句子和专家(MultiPIT_crowd)和专家(MultiPIT_Expert)说明组成,使用两种不同的参数定义来进行参数识别,此外还有多种参考测试集(MultiPIT_NMR)和大型自动构建的培训集(MultiPIT_Auto),用于生成参数。随着数据说明质量的改进和特定任务参数定义的改进,对我们的数据集进行最佳预先培训的语言模型实现了84.2 F1的状态性能,用于自动参数识别。此外,我们的经验结果还表明,在多语系PIT_Autouto 和NCOPara等其他对等对口单位进行微调的参数生成模型更加多样化和高质量。