The stochastic nature of iterative optimization heuristics leads to inherently noisy performance measurements. Since these measurements are often gathered once and then used repeatedly, the number of collected samples will have a significant impact on the reliability of algorithm comparisons. We show that care should be taken when making decisions based on limited data. Particularly, we show that the number of runs used in many benchmarking studies, e.g., the default value of 15 suggested by the COCO environment, can be insufficient to reliably rank algorithms on well-known numerical optimization benchmarks. Additionally, methods for automated algorithm configuration are sensitive to insufficient sample sizes. This may result in the configurator choosing a `lucky' but poor-performing configuration despite exploring better ones. We show that relying on mean performance values, as many configurators do, can require a large number of runs to provide accurate comparisons between the considered configurations. Common statistical tests can greatly improve the situation in most cases but not always. We show examples of performance losses of more than 20%, even when using statistical races to dynamically adjust the number of runs, as done by irace. Our results underline the importance of appropriately considering the statistical distribution of performance values.
翻译:迭代优化超光速学的随机性能导致内在的紧张性性性能测量。由于这些测量往往收集一次,然后反复使用,收集的样本数量将对算法比较的可靠性产生重大影响。我们表明,在根据有限数据作出决定时,应当小心谨慎。我们尤其表明,许多基准研究中所使用的运行量数量,例如COCO环境建议的15个默认值,可能不足以可靠地根据众所周知的数字优化基准进行排序。此外,自动算法配置方法对样本大小不足十分敏感。这可能导致配置师选择“幸运”但表现不佳的配置,尽管探索得更好。我们表明,像许多配置师所做的那样,依赖中等性能值可以提供所考虑的配置之间的准确比较。共同统计测试可以大大改善大多数情况下的情况,但并不总是如此。我们展示了超过20 %的性能损失实例,即使使用统计种族对运行量进行动态调整时也是如此。我们的结果突出表明了适当考虑统计性能分布的重要性。