While quality estimation (QE) can play an important role in the translation process, its effectiveness relies on the availability and quality of training data. For QE in particular, high-quality labeled data is often lacking due to the high-cost and effort associated with labeling such data. Aside from the data scarcity challenge, QE models should also be generalizable, i.e., they should be able to handle data from different domains, both generic and specific. To alleviate these two main issues -- data scarcity and domain mismatch -- this paper combines domain adaptation and data augmentation within a robust QE system. Our method is to first train a generic QE model and then fine-tune it on a specific domain while retaining generic knowledge. Our results show a significant improvement for all the language pairs investigated, better cross-lingual inference, and a superior performance in zero-shot learning scenarios as compared to state-of-the-art baselines.
翻译:在翻译过程中,质量估计(QE)可以发挥重要作用,但其有效性取决于训练数据的可用性和质量。对于特定的QE而言,高质量的标记数据往往因标记此类数据的高成本和努力而缺乏。除了数据稀缺的挑战之外,QE模型还应是通用的,即既能处理来自不同领域的数据,又能处理通用领域的数据。为了缓解这两个主要问题——数据稀缺和领域不匹配,本文将领域适应和数据增强结合到一个强大的QE系统中。我们的方法是首先训练一个通用的QE模型,然后通过保留通用知识在特定领域上进行微调。我们的结果表明,在所有研究的语言对中都有显着的改进,更好的跨语言推理,并且在零-shot学习场景下与现有基线相比具有更优异的性能。