In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final \sts model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.
翻译:在本文中,我们提议了一种新版本版本生成模式,将这一任务作为不受监督的机器翻译(UMT)来对待,所依据的假设是,在大型无标签的单语库中,必须配对表示同样含义的句子。拟议的模式首先将一个大无标签的文体分成多个组群,并用这些组群中的对数来培训多种UMT模型。然后,根据这些UMT模型产生的副词组对数组,可以培训一个统一的代词模型,作为生成副词组的最后模型,该模型可以直接用于在不受监督的设置中测试,或者在受监督的设置中,对标签的数据集进行微调。拟议的方法优于基于机器的译义生成方法,因为它避免了对双语句组的依赖。它还允许人类对模型进行干预,以便使用不同的过滤标准产生更加多样化的文句子。对现有的原词组的参数组进行广泛的实验,展示了拟议的范式的有效性。