作为无人监督的机器翻译而生成的参数 (Paraphrase Generation as Unsupervised Machine Translation)

In this paper, we propose a new paradigm for paraphrase generation by treating the task as unsupervised machine translation (UMT) based on the assumption that there must be pairs of sentences expressing the same meaning in a large-scale unlabeled monolingual corpus. The proposed paradigm first splits a large unlabeled corpus into multiple clusters, and trains multiple UMT models using pairs of these clusters. Then based on the paraphrase pairs produced by these UMT models, a unified surrogate model can be trained to serve as the final Seq2Seq model to generate paraphrases, which can be directly used for test in the unsupervised setup, or be finetuned on labeled datasets in the supervised setup. The proposed method offers merits over machine-translation-based paraphrase generation methods, as it avoids reliance on bilingual sentence pairs. It also allows human intervene with the model so that more diverse paraphrases can be generated using different filtering criteria. Extensive experiments on existing paraphrase dataset for both the supervised and unsupervised setups demonstrate the effectiveness the proposed paradigm.

翻译：在本文中,我们提议了一种新版本版本生成模式,将这一任务作为不受监督的机器翻译(UMT)来对待,其依据的假设是,在大规模无标签的单语体中,必须配对表示相同含义的句子。拟议的模式首先将一个大无标签的文体分成多个组群,并用这些组群中的对数来培训多种UMT模型。然后,根据这些UMT模型制作的副词组对数组,可以培训一个统一的代词模型,作为最终的Seq2Seq 模型,产生可直接用于在不受监督的构件中测试的副词组,或对在受监督的构件中的标签数据集进行微调。拟议的方法优于基于机器翻译的义体生成方法,因为它避免了对双语句子的依赖。还允许人类对模型进行干预,以便使用不同的过滤标准产生更多样化的方言词组。关于受监管和未受监督的组群落的原体的现有方数据组系的广泛实验显示了拟议模式的有效性。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

无监督学习：深度生成模型，35页ppt

专知会员服务

42+阅读 · 2021年7月4日

【Facebook AI】无监督机器翻译，336页ppt，Unsupervised Machine Translation

专知会员服务

19+阅读 · 2020年11月17日