Human evaluation has always been expensive while researchers struggle to trust the automatic metrics. To address this, we propose to customise traditional metrics by taking advantages of the pre-trained language models (PLMs) and the limited available human labelled scores. We first re-introduce the hLEPOR metric factors, followed by the Python version we developed (ported) which achieved the automatic tuning of the weighting parameters in hLEPOR metric. Then we present the customised hLEPOR (cushLEPOR) which uses Optuna hyper-parameter optimisation framework to fine-tune hLEPOR weighting parameters towards better agreement to pre-trained language models (using LaBSE) regarding the exact MT language pairs that cushLEPOR is deployed to. We also optimise cushLEPOR towards professional human evaluation data based on MQM and pSQM framework on English-German and Chinese-English language pairs. The experimental investigations show cushLEPOR boosts hLEPOR performances towards better agreements to PLMs like LaBSE with much lower cost, and better agreements to human evaluations including MQM and pSQM scores, and yields much better performances than BLEU (data available at \url{https://github.com/poethan/cushLEPOR}). Official results show that our submissions win three language pairs including \textbf{English-German} and \textbf{Chinese-English} on \textit{News} domain via cushLEPOR(LM) and \textbf{English-Russian} on \textit{TED} domain via hLEPOR.
翻译:人类评估总是昂贵的, 而研究人员则在努力信任自动测量值时 。 为了解决这个问题, 我们提议通过利用预先培训的语言模型( PLM) 和有限的现有人标分数的优势, 定制传统测量值。 我们首先重新引入了 hLEPOR 测量系数, 其次是我们开发的 Python 版本( 移植), 实现了在 hLEPOR 测量值中加权参数的自动调试。 然后我们展示了定制的 hLEPOR (cushLEPOR) (cushLEPOR) (cushLEPOR) 优化框架, 它使用微调 HLEPOR 加权参数, 以更好地商定预先培训的语言模型( 使用 LaBSE ) 。 我们开发的 cushLEPOR 测量值, 之后我们又选择了基于 MQM 和 PSQM 的人类评估数据。 实验调查显示, CushLEPLOLORO} 在像 PLEBEE 的更好协议上显示, 包括最低成本, 和更好的内部评估。