QUAK:韩国-英语神经机器翻译合成质量估算数据集 (QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation)

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.

翻译：最近神经机翻译的进展表明其重要性,关于质量估计(QE)的研究正在稳步取得进展。QE旨在自动预测机器翻译(MT)产出的质量而无需参考句子。尽管在现实世界中,人工的QE数据创建存在一些限制:由于翻译专家的需要,不可避免地产生非三重成本,数据缩放和语言扩展问题。为了克服这些限制,我们提出了韩国-英语合成QE数据集QUAK,这是一个韩国-英语合成QE数据集,以完全自动的方式生成。它包括三个子QUAK数据集QUAK-M、QUAK-P和QUAK-H,它们都是通过三个相对不受语言限制的战略产生的。由于每项战略不需要人的努力,从而便于缩放,我们把我们的数据提高到1.58M,对于QUAK-P、H和QUA-M来说,我们量化地分析字级QE结果,同时进行统计分析。此外,我们通过进行有效的业绩分析,也显示业绩的提升到数据质量。

相关内容

Machine Translation

关注 209

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日