This paper describes Lingua Custodia's submission to the WMT21 shared task on machine translation using terminologies. We consider three directions, namely English to French, Russian, and Chinese. We rely on a Transformer-based architecture as a building block, and we explore a method which introduces two main changes to the standard procedure to handle terminologies. The first one consists in augmenting the training data in such a way as to encourage the model to learn a copy behavior when it encounters terminology constraint terms. The second change is constraint token masking, whose purpose is to ease copy behavior learning and to improve model generalization. Empirical results show that our method satisfies most terminology constraints while maintaining high translation quality.
翻译:本文描述了 Lingua Custodia 提交WMT21 的关于使用术语进行机器翻译的共享任务的文件。 我们考虑三个方向, 即英语到法语、 俄语和中文。 我们依赖基于变换器的架构作为构件, 我们探索了一种方法, 对处理术语的标准程序进行两大修改。 第一是增加培训数据, 从而鼓励模型在遇到术语限制条件时学习复制行为。 第二是限制符号掩码, 目的是方便复制行为学习, 改进模型的概括化 。 经验性结果显示, 我们的方法在保持高翻译质量的同时, 满足了大多数术语限制 。