Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semi-synthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for non-concatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.
翻译:数据驱动的子字分解已成为开放词汇机器翻译和其他NLP任务的默认策略,但对于最佳学习非分类形态学来说,可能不够通用。我们设计了一个测试套,以评价在受控的半合成环境中不同类型形态现象的分解战略。在实验中,我们比较了在子字和字符层次上受过培训的机器翻译模型能够如何很好地翻译这些形态现象。我们发现,学习分析和产生形态复杂的表层表层表现仍然具有挑战性,特别是对于非分类形态现象,例如复现或元词和谐以及稀有词源而言。我们建议,根据我们的结果,在一系列类型多样的语言上测试新的文字表述战略,以尽可能降低采用无意中使某些语言处于不利地位的战略的风险。