A major challenge in the field of Text Generation is evaluation because we lack a sound theory that can be leveraged to extract guidelines for evaluation campaigns. In this work, we propose a first step towards such a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets. The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems in a given setting. We showcase the application of the theory on the WMT 21 and Spot-The-Bot evaluation data and outline how it can be leveraged to improve the evaluation protocol regarding the reliability, robustness, and significance of the evaluation outcome.
翻译:在这项工作中,我们建议朝着这一理论迈出第一步,将不同不确定因素的来源,如不完善的自动化计量标准和尺寸不高的测试组等纳入其中。该理论具有实际应用性,例如确定在特定环境下可靠地区分一套文本生成系统的性能所需的样本数量。我们展示了WMT 21和Spot-The-Bot评价数据的理论应用情况,并概述了如何利用该理论改进关于评价结果的可靠性、可靠性和重要性的评价程序。