Surrogate algorithms such as Bayesian optimisation are especially designed for black-box optimisation problems with expensive objectives, such as hyperparameter tuning or simulation-based optimisation. In the literature, these algorithms are usually evaluated with synthetic benchmarks which are well established but have no expensive objective, and only on one or two real-life applications which vary wildly between papers. There is a clear lack of standardisation when it comes to benchmarking surrogate algorithms on real-life, expensive, black-box objective functions. This makes it very difficult to draw conclusions on the effect of algorithmic contributions and to give substantial advice on which method to use when. A new benchmark library, EXPObench, provides first steps towards such a standardisation. The library is used to provide an extensive comparison of six different surrogate algorithms on four expensive optimisation problems from different real-life applications. This has led to new insights regarding the relative importance of exploration, the evaluation time of the objective, and the used model. We also provide rules of thumb for which surrogate algorithm to use in which situation. A further contribution is that we make the algorithms and benchmark problem instances publicly available, contributing to more uniform analysis of surrogate algorithms. Most importantly, we include the performance of the six algorithms on all evaluated problem instances. This results in a unique new dataset that lowers the bar for researching new methods as the number of expensive evaluations required for comparison is significantly reduced.
翻译:Bayesian 优化等代用算法是专门为黑盒子优化问题设计的,其目标非常昂贵,例如超参数调制或模拟优化。在文献中,这些算法通常是用已经确立但没有昂贵目标的合成基准来评价的,而只是用一种或两种现实应用来评价,这些应用在纸面上大不相同。当在现实生活、昂贵、黑盒目标功能方面将代用算法基准设定为基准时,显然缺乏标准化。这使得很难就算法贡献的效果作出结论,也很难就何时使用何种方法提出实质性建议。一个新的基准图书馆,EXPOBENCE, 提供了实现这种标准化的第一步。图书馆用来对不同现实生活应用中四种昂贵的替代算法问题进行广泛的比较。这导致对探索的相对重要性、目标的评估时间和所使用的模型等新概念有了新的洞察。我们还提供了用于为在何种情况下使用代用算法提供大量建议。一个新的缩略规则,即新的基准图书馆,即EXPOBE,为这种标准的比较提供了第一个步骤,我们大大降低了六种标准,从而将标准地将数据分析结果作为公共评估的标准。我们用来衡量标准,从而大大地评估了一种比较了一种标准,将所有新的算法和标准评估的比较方法作为标准,将用来评估。