Traditional machine translation (MT) metrics provide an average measure of translation quality that is insensitive to the long tail of behavioral problems in MT. Examples include translation of numbers, physical units, dropped content and hallucinations. These errors, which occur rarely and unpredictably in Neural Machine Translation (NMT), greatly undermine the reliability of state-of-the-art MT systems. Consequently, it is important to have visibility into these problems during model development. Towards this direction, we introduce SALTED, a specifications-based framework for behavioral testing of MT models that provides fine-grained views of salient long-tail errors, permitting trustworthy visibility into previously invisible problems. At the core of our approach is the development of high-precision detectors that flag errors (or alternatively, verify output correctness) between a source sentence and a system output. We demonstrate that such detectors could be used not just to identify salient long-tail errors in MT systems, but also for higher-recall filtering of the training data, fixing targeted errors with model fine-tuning in NMT and generating novel data for metamorphic testing to elicit further bugs in models.
翻译:传统机器翻译(MT)衡量标准提供了一种对MT中行为问题长期尾端不敏感的平均翻译质量标准。例子包括数字、物理单位、下降内容和幻觉的翻译。这些错误在神经机器翻译(NMT)中很少发生,也难以预测,大大削弱了最先进的MT系统的可靠性。因此,在模型开发过程中必须能见这些问题。朝着这个方向,我们引入了SALTED,这是一个基于规格的MT模型行为测试框架,它提供了显著长尾错误的精细分辨观点,使得人们能够对先前的无形问题进行可信赖的可见度。我们的方法核心是开发高精度检测器,在源句和系统输出之间标记错误(或核实产出的正确性)。我们证明,这种检测器不仅可以用来查明MT系统中明显的长尾错误,而且可以用来对培训数据进行更高级的过滤,在NMTT中通过模型的微调确定有针对性的错误,并生成新的数据,用于元式测试,以进一步导出模型中的错误。