While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data and also most similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods.
翻译:虽然在神经机器翻译(NMT)中探索了积极学习(AL)技术,但只有少数工作侧重于处理低批注预算,因为可以翻译数量有限的刑期,这种情况尤其具有挑战性,而且对于濒危语言而言,只有很少的人类说明员或对大量数据标签有成本限制的濒危语言,这种情况尤其具有挑战性。虽然事实证明AL在大量预算方面很有帮助,但在这些低资源条件下建设高质量的翻译系统是不够的。在这项工作中,我们建议采用成本效益高的培训程序,利用少量附加说明的句子和字典条目提高NMT模型的性能。我们的方法利用单语数据与自我监督目标相结合,并使用小规模、廉价的字典进行额外监督,以启动NMT模型。我们表明,利用这些知识来源的组合来改进模型对于利用AL战略和增加低资源条件下的收益至关重要。我们还提出了一个受NMT的域适应启发的新型AL战略,并表明它对低预算是有效的。我们提出了一个新的混合数据驱动方法,其样本化的句子与标定的模型数据不同,而且与ALMT战略中最相似。最后,我们用了13的常规的AL-IMUA数据进行比较。