There have been several studies on the correlation between human ratings and metrics such as BLEU, chrF2 and COMET in machine translation. Most, if not all consider full-sentence translation. It is unclear whether human ratings of simultaneous speech translation Continuous Rating (CR) correlate with these metrics or not. Therefore, we conduct an extensive correlation analysis of CR and the aforementioned automatic metrics on evaluations of candidate systems at English-German simultaneous speech translation task at IWSLT 2022. Our studies reveal that the offline MT metrics correlate with CR and can be reliably used for evaluating machine translation in the simultaneous mode, with some limitations on the test set size. This implies that automatic metrics can be used as proxies for CR, thereby alleviating the need for human evaluation.
翻译:人类评级与机器翻译中的BLEU、chrF2和KOCT等衡量标准之间的相关性已经进行了若干项研究。大多数研究(如果不是全部的话)都考虑全句翻译。尚不清楚同时语音翻译的人类评级是否与这些衡量标准相关。因此,我们在IWSLT 2022 IWSLT IWSLT 上,对CR和上述关于候选人系统评价的自动衡量标准进行了广泛的相关分析。 我们的研究显示,离线的MT衡量标准与CR相关,可以可靠地用于同步模式的机器翻译评价,同时使用的标准尺寸也有一些限制。这意味着自动衡量标准可以用作CR的代号,从而减轻了对人评价的需要。