In machine translation field, in both academia and industry, there is a growing interest in increasingly powerful systems, using corpora of several hundred million to several billion examples. These systems represent the state-of-the-art. Here we defend the idea of developing in parallel <<frugal>> bilingual translation systems, trained with relatively small corpora. Based on the observation of a standard human professional translator, we estimate that the corpora should be composed at maximum of a monolingual sub-corpus of 75 million examples for the source language, a second monolingual sub-corpus of 6 million examples for the target language, and an aligned bilingual sub-corpus of 6 million bi-examples. A less desirable alternative would be an aligned bilingual corpus of 47.5 million bi-examples.
翻译:在计算机翻译领域,学术界和工业界对日益强大的系统越来越感兴趣,使用数亿至数十亿个实例,这些系统代表了最先进的技术。在这里,我们捍卫了平行开发“frugal”双语翻译系统的想法,这些系统受过相对较小的公司的培训。根据对标准人类专业翻译员的观察,我们估计,该公司最多应该由7 500万个单一语言子公司(源语言)的7 500万个实例组成,第二个单一语言子公司(目标语言)的600万个实例组成,一个配对的双语子公司(600万个双倍)组成。 一种不可取的替代办法是配对的4 750万个双语双倍样本。