Machine Translation systems can produce different types of errors, some of which get characterized as critical or catastrophic due to the specific negative impact they can have on users. Automatic or human evaluation metrics do not necessarily differentiate between such critical errors and more innocuous ones. In this paper we focus on one type of critical error: added toxicity. We evaluate and analyze added toxicity when translating a large evaluation dataset (HOLISTICBIAS, over 472k sentences, covering 13 demographic axes) from English into 164 languages. The toxicity automatic evaluation shows that added toxicity across languages varies from 0% to 5%. The output languages with the most added toxicity tend to be low-resource ones, and the demographic axes with the most added toxicity include sexual orientation, gender and sex, and ability. We also perform human evaluation on a subset of 8 directions, confirming the prevalence of true added toxicity. We use a measurement of the amount of source contribution to the translation, where a low source contribution implies hallucination, to interpret what causes toxicity. We observe that the source contribution is somewhat correlated with toxicity but that 45.6% of added toxic words have a high source contribution, suggesting that much of the added toxicity may be due to mistranslations. Combining the signal of source contribution level with a measurement of translation robustness allows us to flag 22.3% of added toxicity, suggesting that added toxicity may be related to both hallucination and the stability of translations in different contexts. Given these findings, our recommendations to reduce added toxicity are to curate training data to avoid mistranslations, mitigate hallucination and check unstable translations.
翻译:机器翻译系统可以产生不同类型的错误,其中一些因对用户的具体负面影响而被视为关键或灾难性的错误,有些被定性为关键或灾难性的错误,有些被定性为关键或灾难性的错误,自动或人文评价指标不一定区分于此类关键错误和无毒性的错误。在本文件中,我们侧重于一种类型的关键错误:增加毒性。我们在将一个大型评价数据集(HOLISTICBABIAS,472k多句,涵盖13个人口轴)从英语翻译到164种语言时,我们评价和分析增加的毒性。毒性自动评价表明,不同语言之间增加的毒性从0 %到5%不等。增加的毒性产出语言往往为低资源,而增加的毒性最大的是低资源,而增加毒性的人口轴则包括性取向、性别和性以及能力。我们还在8个方向的子集中进行人类评价,确认真实增加的毒性。我们用量衡量翻译来源对翻译的贡献,其中低来源意味着产生幻觉,解释造成毒性的原因。我们发现,来源的贡献与毒性有关,但增加的毒性翻译有45.6%的毒性语言具有较高的来源贡献,表明,在提高培训的可靠性和测量过程中的信号的稳定性可能增加的毒性程度,从而导致与标记的毒性。