Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native language models even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English language models. We demonstrate empirically that a large English language model coupled with modern machine translation outperforms native language models in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages. This paper therefore strives to make a provocative but important point. As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build language models for such languages.
翻译:国家语言方案的大多数工作都假定,最好以有关当地语言制定解决办法。因此,即使为低资源语言也出现了建立本地语言模式的强烈趋势。本文对此提出疑问,并探讨简单地将数据翻译成英语的想法,从而能够使用预先培训和大规模英语模式。我们从经验上证明,大型英语模式加上现代机器翻译,在多数斯堪的纳维亚语言中优于本地语言模式。除此之外,我们假定芬兰语是例外,因为翻译质量低下。我们的结果表明,机器翻译是一种成熟的技术,为培训低资源语言的本地语言模式提出了一个严重的反争论。因此,本文力求提出一个具有挑衅性但重要的观点。由于英语模式正在以前所未有的速度改进,这反过来又改善了机器翻译,因此,将低资源语言的数据翻译成英语比为这类语言建立语言模式更为有效。