Traditional automatic evaluation metrics for machine translation have been widely criticized by linguists due to their low accuracy, lack of transparency, focus on language mechanics rather than semantics, and low agreement with human quality evaluation. Human evaluations in the form of MQM-like scorecards have always been carried out in real industry setting by both clients and translation service providers (TSPs). However, traditional human translation quality evaluations are costly to perform and go into great linguistic detail, raise issues as to inter-rater reliability (IRR) and are not designed to measure quality of worse than premium quality translations. In this work, we introduce HOPE, a task-oriented and human-centric evaluation framework for machine translation output based on professional post-editing annotations. It contains only a limited number of commonly occurring error types, and use a scoring model with geometric progression of error penalty points (EPPs) reflecting error severity level to each translation unit. The initial experimental work carried out on English-Russian language pair MT outputs on marketing content type of text from highly technical domain reveals that our evaluation framework is quite effective in reflecting the MT output quality regarding both overall system-level performance and segment-level transparency, and it increases the IRR for error type interpretation. The approach has several key advantages, such as ability to measure and compare less than perfect MT output from different systems, ability to indicate human perception of quality, immediate estimation of the labor effort required to bring MT output to premium quality, low-cost and faster application, as well as higher IRR. Our experimental data is available at \url{https://github.com/lHan87/HOPE}.
翻译:对机器翻译的传统自动评价指标受到语言学家的广泛批评,原因是其准确性低、缺乏透明度、注重语言力学而不是语义学,而且与人的质量评价缺乏一致; 客户和翻译服务供应商(TSPs)总是在实际行业环境中以MQM式记分卡的形式进行人文评价; 然而,传统的人文翻译质量评价执行成本高、语言详细程度高、提出了跨时代可靠性问题,其设计目的不是要衡量质量比优质翻译差的质量。 在这项工作中,我们采用了HOPE,即基于专业编辑后说明的机器翻译产出面向任务和以人为中心的评价框架。 以MQM计分卡的形式进行的人类评价,通常只有数量有限的常见错误类型,并且使用一个评分模型,对每个翻译单位的误差程度有更高的程度。 有关高技术领域文本营销的英语和俄语双对MT产出的初步实验工作表明,我们的评价框架在反映总体系统质量质量方面相当有效,而从系统层面的准确性能和关键部门一级数据评估能力,从一种准确性成本/现有成本估算系统,将它提升到不同程度的准确性产出系统。