Domain Adaptation (DA) of Neural Machine Translation (NMT) model often relies on a pre-trained general NMT model which is adapted to the new domain on a sample of in-domain parallel data. Without parallel data, there is no way to estimate the potential benefit of DA, nor the amount of parallel samples it would require. It is however a desirable functionality that could help MT practitioners to make an informed decision before investing resources in dataset creation. We propose a Domain adaptation Learning Curve prediction (DaLC) model that predicts prospective DA performance based on in-domain monolingual samples in the source language. Our model relies on the NMT encoder representations combined with various instance and corpus-level features. We demonstrate that instance-level is better able to distinguish between different domains compared to corpus-level frameworks proposed in previous studies. Finally, we perform in-depth analyses of the results highlighting the limitations of our approach, and provide directions for future research.
翻译:神经机器翻译(NMT)模型的域适应(DA)往往依赖于事先经过培训的一般NMT模型,该模型根据主页平行数据的样本适应新领域,没有平行数据,就无法估计DA的潜在惠益,也不可能估计它所需的平行样本数量,然而,这是一个可取的功能,可以帮助MT从业人员在将资源投入到数据集的创建中之前作出知情决定。我们提议了一个“Demain适应学习曲线预测(DaLC)”模型,该模型根据源语中主页单一语言样本预测DA的预期性能。我们的模型依靠NMT编码器的表述,加上各种实例和实体层面的特征。我们证明,与以往研究中提议的实体层面框架相比,实例层面能够更好地区分不同领域。最后,我们深入分析了结果,突出我们方法的局限性,并为今后的研究提供方向。