In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final Seq2Seq model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.
翻译:在本文中,我们提议了一种新版本版本生成模式,将这一任务作为不受监督的机器翻译(UMT)来对待,其依据的假设是,在大规模无标签的单语体中,必须配对表示相同含义的句子。拟议的模式首先将一个大无标签的文体分成多个组群,并用这些组群中的对数来培训多种UMT模型。然后,根据这些UMT模型制作的副词组对数组,可以培训一个统一的代词模型,作为最终的Seq2Seq 模型,产生可直接用于在不受监督的构件中测试的副词组,或对在受监督的构件中的标签数据集进行微调。拟议的方法优于基于机器翻译的义体生成方法,因为它避免了对双语句子的依赖。还允许人类对模型进行干预,以便使用不同的过滤标准产生更多样化的方言词组。关于受监管和未受监督的组群落的原体的现有方数据组系的广泛实验显示了拟议模式的有效性。