无强力监督的多语化表达式图示 (Idiomatic Expression Paraphrasing without Strong Supervision)

Idiomatic expressions (IEs) play an essential role in natural language. In this paper, we study the task of idiomatic sentence paraphrasing (ISP), which aims to paraphrase a sentence with an IE by replacing the IE with its literal paraphrase. The lack of large-scale corpora with idiomatic-literal parallel sentences is a primary challenge for this task, for which we consider two separate solutions. First, we propose an unsupervised approach to ISP, which leverages an IE's contextual information and definition and does not require a parallel sentence training set. Second, we propose a weakly supervised approach using back-translation to jointly perform paraphrasing and generation of sentences with IEs to enlarge the small-scale parallel sentence training dataset. Other significant derivatives of the study include a model that replaces a literal phrase in a sentence with an IE to generate an idiomatic expression and a large scale parallel dataset with idiomatic/literal sentence pairs. The effectiveness of the proposed solutions compared to competitive baselines is seen in the relative gains of over 5.16 points in BLEU, over 8.75 points in METEOR, and over 19.57 points in SARI when the generated sentences are empirically validated on a parallel dataset using automatic and manual evaluations. We demonstrate the practical utility of ISP as a preprocessing step in En-De machine translation.

翻译：在本文中,我们研究的是语言语言参数学的任务。在本文中,我们研究的是语言语言句参数学(ISP)的任务,其目的是用IE来用IE来换一个句子,用其字面句子来取代IE。缺乏具有语言-语言-平行句子的大型组合体是这项任务面临的主要挑战,我们考虑两种不同的解决办法。首先,我们建议对ISP采取不受监督的方法,利用IE的背景信息和定义,不需要平行的句子培训。第二,我们建议采用一种由IE来用回译法用IE换一个句子,用IE换字句用其字句取代IE。研究的其他重要衍生物包括一种模型,用IE取代一句中的字句,以产生一种语言表达和与iEndical/语言句子的大规模平行数据集。在实际翻译中,使用MASTER的5.16点的相对增益,在SAR AVERA中, MAEU ASRA AS AS ASRA ASRA ASU ASU ASU ASU ASU ASU ASUTIOL ASU 上, ASU ASU ASU ASU ASU ASU 58757 的自动翻译超过516点。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

【NUS】神经问题生成的最近进展（Recent Advances in Neural Question Generation）

专知会员服务

16+阅读 · 2019年12月22日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日