如何适合翻译非传染病学的子词分类战略? (How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?)

Data-driven subword segmentation has become the default strategy for open-vocabulary machine translation and other NLP tasks, but may not be sufficiently generic for optimal learning of non-concatenative morphology. We design a test suite to evaluate segmentation strategies on different types of morphological phenomena in a controlled, semi-synthetic setting. In our experiments, we compare how well machine translation models trained on subword- and character-level can translate these morphological phenomena. We find that learning to analyse and generate morphologically complex surface representations is still challenging, especially for non-concatenative morphological phenomena like reduplication or vowel harmony and for rare word stems. Based on our results, we recommend that novel text representation strategies be tested on a range of typologically diverse languages to minimise the risk of adopting a strategy that inadvertently disadvantages certain languages.

翻译：数据驱动的子字分解已成为开放词汇机器翻译和其他NLP任务的默认策略,但对于最佳学习非分类形态学来说,可能不够通用。我们设计了一个测试套,以评价在受控的半合成环境中不同类型形态现象的分解战略。在实验中,我们比较了在子字和字符层次上受过培训的机器翻译模型能够如何很好地翻译这些形态现象。我们发现,学习分析和产生形态复杂的表层表层表现仍然具有挑战性,特别是对于非分类形态现象,例如复现或元词和谐以及稀有词源而言。我们建议,根据我们的结果,在一系列类型多样的语言上测试新的文字表述战略,以尽可能降低采用无意中使某些语言处于不利地位的战略的风险。

相关内容

Machine Translation

关注 0

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

5G+ICT趋势白皮书（2021年），53页pdf

专知会员服务

58+阅读 · 2021年3月15日

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日