Unlike English, morphologically rich languages can reveal characteristics of speakers or their conversational partners, such as gender and number, via pronouns, morphological endings of words and syntax. When translating from English to such languages, a machine translation model needs to opt for a certain interpretation of textual context, which may lead to serious translation errors if extra-textual information is unavailable. We investigate this challenge in the English-to-Polish language direction. We focus on the underresearched problem of utilising external metadata in automatic translation of TV dialogue, proposing a case study where a wide range of approaches for controlling attributes in translation is employed in a multi-attribute scenario. The best model achieves an improvement of +5.81 chrF++/+6.03 BLEU, with other models achieving competitive performance. We additionally contribute a novel attribute-annotated dataset of Polish TV dialogue and a morphological analysis script used to evaluate attribute control in models.
翻译:与英文不同的是,形式上丰富的语言可以显示发言者或其对话伙伴的特点,例如性别和数量,如通过名词、文字的形态结尾和语法等性别和数量。当将英语翻译到这种语言时,机器翻译模式需要选择某种对文本背景的解释,如果没有外文信息,可能导致严重的翻译错误。我们在英语到波兰语的语言方向上调查这一挑战。我们着重研究在电视对话自动翻译中使用外部元数据方面研究不足的问题,提出在多属性情况下采用多种翻译属性控制方法的案例研究。最佳模式实现了+5.81chrF++/6.03 BLEU的改进,而其他模式则取得了竞争性的绩效。我们还贡献了波兰电视对话中带有属性说明的新数据集和用于评价模型属性控制的形态分析脚本。