The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.
翻译:机器翻译系统(MT)的迅速发展使得有必要进行全面研究,对正在使用的评价衡量标准进行元评价,从而能够更好地选择最能反映MT质量的衡量标准。 不幸的是,大多数研究侧重于高资源语言,主要是英语,其观察结果可能并不总是适用于其他语言。印度语言有10亿以上的语言,在语言上与英语不同,迄今为止,还没有系统地研究如何评价从英语到印度语言的MT系统。在本文中,我们通过建立一个由7000个精细说明组成的MQM数据集来填补这一空白,该数据集涵盖7000个印度语言和7个MT系统,并利用该数据集来建立说明分数与使用现有自动衡量标准获得的分数之间的联系。我们的结果显示,预先培训的衡量标准,如知识与技术委员会,在语言上的评分具有最高的关联性。此外,我们发现,衡量标准并未充分捕捉到印度语言中基于流利的错误,因此需要制定侧重于印度语言的衡量标准。我们希望,我们的数据集和分析将有助于推动这一领域的进一步研究。