We release our synthetic parallel paraphrase corpus across 17 languages: Arabic, Catalan, Czech, German, English, Spanish, Estonian, French, Hindi, Indonesian, Italian, Dutch, Romanian, Russian, Swedish, Vietnamese, and Chinese. Our method relies only on monolingual data and a neural machine translation system to generate paraphrases, hence simple to apply. We generate multiple translation samples using beam search and choose the most lexically diverse pair according to their sentence BLEU. We compare our generated corpus with the \texttt{ParaBank2}. According to our evaluation, our synthetic paraphrase pairs are semantically similar and lexically diverse.
翻译:我们通过17种语言(阿拉伯语、加泰罗尼亚语、捷克语、德语、英语、西班牙语、爱沙尼亚语、法语、印地语、印度尼西亚语、意大利语、荷兰语、罗马尼亚语、俄语、瑞典语、越南语和汉语)发布合成平行副句,我们的方法只依靠单语数据和神经机器翻译系统来生成副句,因此应用简便。我们利用光束搜索生成多个翻译样本,并根据BLEU的句子选择最有法则多样性的一对。我们把我们生成的副句子与\ textt{ParaBank2}作比较。根据我们的评估,我们的合成副句子在语义上相似,在词汇上也各不相同。