We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following Wieting et al. (2017). Our hope is that ParaNMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic knowledge to improve downstream natural language understanding tasks. To show its utility, we use ParaNMT-50M to train paraphrastic sentence embeddings that outperform all supervised systems on every SemEval semantic textual similarity competition, in addition to showing how it can be used for paraphrase generation.
翻译:我们描述PARANMT-50M,这是一个由5 000多万英英、英、安、安、安、保、保、配对组成的数据集。我们通过使用神经机翻译,在Wieting等人(2017年)之后,自动生成了配对,以翻译一个大型平行体的非英文侧面。我们希望PARNMT-50M能够成为参数生成的宝贵资源,并能够提供丰富的语义知识来源,以改善下游自然语言理解任务。为了展示其实用性,我们使用PARNMT-50M来培训超过SemEval语文本相似性竞争中所有受监督系统的副词句嵌入,并展示如何将其用于参数生成。