Language modelling and machine translation tasks mostly use subword or character inputs, but syllables are seldom used. Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size. In this study, we first explore the potential of syllables for open-vocabulary language modelling in 21 languages. We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy. With a comparable perplexity, we show that syllables outperform characters and other subwords. Moreover, we study the importance of syllables on neural machine translation for a non-related and low-resource language-pair (Spanish--Shipibo-Konibo). In pairwise and multilingual systems, syllables outperform unsupervised subwords, and further morphological segmentation methods, when translating into a highly synthetic language with a transparent orthography (Shipibo-Konibo). Finally, we perform some human evaluation, and discuss limitations and opportunities.
翻译:语言建模和机器翻译任务大多使用子字或字符输入,但很少使用音频。交响器提供比字符更短的序列,比字符更短的顺序,需要比模形更专业的提取规则,其分化不受体积大小的影响。在本研究中,我们首先探索以21种语言进行开放式语言建模的交响器潜力。我们使用基于规则的六种语言的交响法方法,用断字处理其余部分,作为交响词的代词。我们以相似的不易解方式显示,交响器超越了形字符和其他子字眼。此外,我们研究神经机器翻译对非相关和低资源语言波(西班牙语-希皮博-科尼博语-科尼博语)的重要性。在配对式和多语系中,交响音器超越了不受监督的子词形,还有进一步的形态分割方法,在将透明或拼写语言转换为高度合成语言时,我们进行了一些评估,并进行了一些机会(希皮博-科博语种限制)。