Benchmarking studies in computational chemistry use reference datasets to assess the accuracy of a method through error statistics. The commonly used error statistics, such as the mean signed and mean unsigned errors, do not inform end-users on the expected amplitude of prediction errors attached to these methods. We show that, the distributions of model errors being neither normal nor zero-centered, these error statistics cannot be used to infer prediction error probabilities. To overcome this limitation, we advocate for the use of more informative statistics, based on the empirical cumulative distribution function of unsigned errors, namely (1) the probability for a new calculation to have an absolute error below a chosen threshold, and (2) the maximal amplitude of errors one can expect with a chosen high confidence level. Those statistics are also shown to be well suited for benchmarking and ranking studies. Moreover, the standard error on all benchmarking statistics depends on the size of the reference dataset. Systematic publication of these standard errors would be very helpful to assess the statistical reliability of benchmarking conclusions.
翻译:计算化学基准研究使用参考数据集来评估一种方法通过误差统计的准确性。常用误差统计,如平均签名和平均未签名误差,并不使最终用户了解这些方法所附预测误差的预期振幅。我们表明,模型误差的分布既不正常,也不以零为中心,这些误差统计不能用来推断预测误差概率。为了克服这一限制,我们主张根据未签名误差的经验累积分布功能,使用更翔实的统计数据,即:(1) 新计算绝对误差低于所选阈值的可能性,以及(2) 所选的信任度高,可以预期出错的最大振幅。这些统计数据也显示非常适合基准和排名研究。此外,所有基准统计的标准误差取决于参考数据集的大小。系统公布这些标准误差将大大有助于评估基准结论的统计可靠性。