Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the 'algorithmic bias', i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: 'machine translationese'. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms - phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and morphological richness in the translations produced by all investigated MT paradigms for two language pairs (EN<=>FR and EN<=>ES).
翻译:最近在机器翻译(MT)和自然语言处理(NLP)领域的研究显示,现有模式扩大了培训数据中观察到的偏见,语言技术偏见的扩大主要在性别偏见等特定现象方面进行了研究。在这项工作中,我们超越了对MT性别的研究,并调查偏见扩大会如何从更广泛的意义上影响语言。我们假设“语言偏差”,即经常观察到的模式与较不常见的模式的丧失相结合加剧,不仅加剧目前数据集中存在的社会偏见,而且还可能导致人为的贫困语言:“机器翻译”。我们评估了由不同数据驱动的MT模式――基于词的统计(PB-SMT)和神经MT(NMT)――所创造的翻译的语言丰富性(在一种词汇和形态层面上)。我们的实验表明,所有经过调查的MT模式为两种语言制作的译文(ENQFR和ENES)都丧失了词汇和形态上的丰富性。