Modern neural machine translation (NMT) models have achieved competitive performance in standard benchmarks. However, they have recently been shown to suffer limitation in compositional generalization, failing to effectively learn the translation of atoms (e.g., words) and their semantic composition (e.g., modification) from seen compounds (e.g., phrases), and thus suffering from significantly weakened translation performance on unseen compounds during inference. We address this issue by introducing categorization to the source contextualized representations. The main idea is to enhance generalization by reducing sparsity and overfitting, which is achieved by finding prototypes of token representations over the training set and integrating their embeddings into the source encoding. Experiments on a dedicated MT dataset (i.e., CoGnition) show that our method reduces compositional generalization error rates by 24\% error reduction. In addition, our conceptually simple method gives consistently better results than the Transformer baseline on a range of general MT datasets.
翻译:现代神经机翻译(NMT)模型在标准基准方面实现了竞争性绩效,但最近显示,这些模型在组成通用方面受到限制,未能有效地从可见化合物(如短语)中有效地学习原子(如文字)及其语义构成(如修改)的翻译,因此在推断过程中,在无形化合物上出现显著削弱的翻译性能。我们通过对来源背景描述进行分类来解决这一问题。主要想法是通过减少分散性和过度装配,通过在成套培训中找到象征性表示的原型并将其嵌入源编码来实现。关于专用的MT数据集(如CoGnition)的实验表明,我们的方法减少了24 ⁇ 错误率。此外,我们概念上简单的方法在一系列通用MT数据集上总是比变异基线结果更好。