The standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features to be incorporated (e.g. POS tags, dependency relation label), embed them, and then aggregate them with each subword in the word they belong to. This approach, however, cannot easily accommodate annotation schemes that are not dense for every word. We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality for the in-domain data. Experiments are performed in morphologically-rich languages like Basque and German, for the case of low-resource scenarios.
翻译:将语言信息纳入神经机器翻译系统的标准做法是,为拟纳入的每个附加说明的特征(例如POS标签、依赖关系标签)保留单独的词汇,将其嵌入,然后将其与每个子字加在一起。然而,这种方法不能轻易地适应并非每个字都密集的注解计划。我们建议了适合这种情况的方法,表明外地数据大有改进,而且内部数据的质量相当。 实验是以诸如巴斯克语和德语等形态上丰富的语言进行,以低资源情景为例。