Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.
翻译:自然语言理解和生成模式遵循了两种主要建筑范式之一:语言模型(LMS),该模式处理在单堆层层中排列序列的顺序,以及编码解码模型(EncDec),该模型利用不同的层层堆堆投入和输出处理。在机器翻译中,EncDec长期以来一直是偏好的方法,但很少研究LMS的性能。在这项工作中,我们彻底审查了几项建筑设计选择对LMS双语(大规模)多语种和零弹式翻译任务绩效的作用,在数据条件和模型大小的系统性变化下,LMS可与EcDec不同的规模特性,其中建筑差异往往对小规模的模型性能产生重大影响,但随着参数数量的增加,性能差距缩小,(二) 几种设计选择,包括因果遮掩护和源序列的语言模型设计目标,对翻译质量产生有害影响,以及(三) 当与源序列的全视遮罩配合时,LMS可与EcDec(监督的双语和多语种翻译方向的减少)一起工作,通过零度改进。