State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source--variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English--Russian MT system to generate Ukrainian and Belarusian, an English--Norwegian Bokm{\aa}l system to generate Nynorsk, and an English--Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.
翻译:最先进的机器翻译(MT)系统一般都经过培训,以产生“标准”目标语言;然而,许多语言的多种品种(区域品种、方言、社会理解、非本地品种)与标准语言不同,这些品种往往资源贫乏,因此无法从当代NLP解决方案中受益,包括MT。我们提出了一个总体框架,用于迅速调整MT系统,以生成接近标准目标语言但与标准目标语言相异的语文品种,使用不平行的(来源多样性)数据。这也包括使MT系统适应与资源低级类型相关的目标语言。我们试行英语-俄语MT系统,以生成乌克兰语和白俄罗斯语,一种英语-挪威语-挪威语-Bokm_a}l系统,以生成Nynornorsk语,以及一种英语-阿拉伯语系统,以生成四种阿拉伯语方言,在竞争基线基础上取得显著改进。