Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers employ a variety of heuristic techniques, including searching for the conditional mode (vs. sampling) and incorporating various training heuristics (e.g., label smoothing). While search strategies significantly improve BLEU score, they yield deterministic outputs that lack the diversity of human translations. Moreover, search tends to bias the distribution of translated gender pronouns. This makes human-level BLEU a misleading benchmark in that modern MT systems cannot approach human-level BLEU while simultaneously maintaining human-level translation diversity. In this paper, we characterize distributional differences between generated and real translations, examining the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our study implicates search as a salient source of known bias when translating gender pronouns.
翻译:神经机器翻译系统(NMT)通常使用自动衡量标准进行评估,评估生成的翻译和地面真理候选人之间的协议。为了改进这些衡量标准方面的系统,NLP研究人员采用各种制革技术,包括寻找有条件模式(v.采样),并纳入各种培训的湿度学(例如标签平滑),虽然搜索战略大大提高了BLEU的分数,但它们产生缺乏人文翻译多样性的确定性产出。此外,搜索往往偏向于翻译的性别代名词的分布。这使得人类一级BLEU成为误导性基准,因为现代的MT系统不能接近人文层次的BLEU,同时保持人文层次的翻译多样性。在本文中,我们描述生成的翻译与实际翻译之间的分配差异,审查为NMT享有的BLEU分数支付的多样性成本。此外,我们的研究还暗示,在翻译性别代名词时,搜索是已知偏见的明显来源。