寻求对MT进行人文评估的更高权力 (Searching for a higher power in the human evaluation of MT)

In MT evaluation, pairwise comparisons are conducted to identify the better system. In conducting the comparison, the experimenter must allocate a budget to collect Direct Assessment (DA) judgments. We provide a cost effective way to spend the budget, but show that typical budget sizes often do not allow for solid comparison. Taking the perspective that the basis of solid comparison is in achieving statistical significance, we study the power (rate of achieving significance) on a large collection of pairwise DA comparisons. Due to the nature of statistical estimation, power is low for differentiating less than 1-2 DA points, and to achieve a notable increase in power requires at least 2-3x more samples. Applying variance reduction alone will not yield these gains, so we must face the reality of undetectable differences and spending increases. In this context, we propose interim testing, an "early stopping" collection procedure that yields more power per judgment collected, which adaptively focuses the budget on pairs that are borderline significant. Interim testing can achieve up to a 27% efficiency gain when spending 3x the current budget, or 18% savings at the current evaluation power.

翻译：在MT评估中,进行对称比较是为了确定更好的系统。在进行比较时,实验者必须分配预算来收集直接评估(DA)的判断。我们提供了一种成本效益高的花费预算的方法,但表明典型的预算规模往往无法进行扎实的比较。从扎实的比较基础是达到统计意义的角度出发,我们研究大量收集对称的DA比较的功率(实现意义率),由于统计估计的性质,区分少于1-2 DA点的功率较低,并实现显著提高权力需要至少增加2-3x的样本。单靠减少差异不会产生这些收益,因此我们必须面对无法检测的差异和支出增加的现实。在这方面,我们提议临时测试,即“早期停止”收集程序,根据收集的判断产生更大的权力,这种“早期停止”程序将预算集中用于具有边际重要性的对子的预算。在支出当前预算时可以达到27%的效率收益,或者在目前的评估能力下节省18%。