Although a machine translation model trained with a large in-domain parallel corpus achieves remarkable results, it still works poorly when no in-domain data are available. This situation restricts the applicability of machine translation when the target domain's data are limited. However, there is great demand for high-quality domain-specific machine translation models for many domains. We propose a framework that efficiently and effectively collects parallel sentences in a target domain from the web with the help of crowdworkers. With the collected parallel data, we can quickly adapt a machine translation model to the target domain. Our experiments show that the proposed method can collect target-domain parallel data over a few days at a reasonable cost. We tested it with five domains, and the domain-adapted model improved the BLEU scores to +19.7 by an average of +7.8 points compared to a general-purpose translation model.
翻译:尽管经过大量内部平行材料培训的机器翻译模型取得了显著成果,但当没有内部数据时仍然效果不佳。当目标领域的数据有限时,这种情况限制了机器翻译的适用性。然而,许多领域对高质量特定域的机器翻译模型的需求很大。我们提出了一个框架,在人群工人的帮助下,高效率和有成效地从网络上从目标领域收集平行的句子。通过收集平行数据,我们可以迅速将机器翻译模型调整到目标领域。我们的实验表明,拟议的方法可以在几天内以合理的成本收集目标领域平行数据。我们用五个领域测试了该方法,而适合域的模式将BLEU的分数提高到了+19.7,与通用翻译模型相比,平均增加了7.8个百分点。