Theoretical work in morphological typology offers the possibility of measuring morphological diversity on a continuous scale. However, literature in Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level. We consider Payne (2017)'s approach to classify morphology using two indices: synthesis (e.g. analytic to polysynthetic) and fusion (agglutinative to fusional). For computing synthesis, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study. Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish, and verbs in English-Spanish) and segment level (previous language pairs plus English-German in both directions). We complement the word-level analysis with human evaluation, and overall, we observe a consistent impact of both indexes on machine translation quality.
翻译:形态学学的理论工作提供了持续测量形态多样性的可能性,然而,自然语言处理(NLP)的文献通常将整个语言贴上严格类型的形态学标签,例如混凝土或混凝土。在这项工作中,我们提议减少这种主张的僵硬性,办法是在字和分层一级量化形态学类型学。我们考虑Payne(2017年)采用两种指数对形态学进行分类的方法:合成(如对合成合成的解析)和聚合(对聚合的杂合)。在计算合成方面,我们测试英语、德语和土耳其语的不受监督和监管的形态分解方法,而在聚合方面,我们建议采用一种半自动方法,用西班牙语作为案例研究。然后,我们分析机器翻译质量与语言合成和融合程度之间的关系(英语-土耳其语的词和动词)以及分层(英语-西班牙语的词和动词)和分段一级(英语-西班牙语的预言配方语言加英语和土耳其语分解法)以及整体方向上我们观测单词质量和英语-英语质量分析的对比。