利用概率变压器(Surrogates)进行抽样有效优化 (Sample-Efficient Optimisation with Probabilistic Transformer Surrogates)

Faced with problems of increasing complexity, recent research in Bayesian Optimisation (BO) has focused on adapting deep probabilistic models as flexible alternatives to Gaussian Processes (GPs). In a similar vein, this paper investigates the feasibility of employing state-of-the-art probabilistic transformers in BO. Upon further investigation, we observe two drawbacks stemming from their training procedure and loss definition, hindering their direct deployment as proxies in black-box optimisation. First, we notice that these models are trained on uniformly distributed inputs, which impairs predictive accuracy on non-uniform data - a setting arising from any typical BO loop due to exploration-exploitation trade-offs. Second, we realise that training losses (e.g., cross-entropy) only asymptotically guarantee accurate posterior approximations, i.e., after arriving at the global optimum, which generally cannot be ensured. At the stationary points of the loss function, however, we observe a degradation in predictive performance especially in exploratory regions of the input space. To tackle these shortcomings we introduce two components: 1) a BO-tailored training prior supporting non-uniformly distributed points, and 2) a novel approximate posterior regulariser trading-off accuracy and input sensitivity to filter favourable stationary points for improved predictive performance. In a large panel of experiments, we demonstrate, for the first time, that one transformer pre-trained on data sampled from random GP priors produces competitive results on 16 benchmark black-boxes compared to GP-based BO. Since our model is only pre-trained once and used in all tasks without any retraining and/or fine-tuning, we report an order of magnitude time-reduction, while matching and sometimes outperforming GPs.

翻译：面对日益复杂的问题,最近对巴伊西亚优化(BO)的研究侧重于将深度概率模型改制成高斯进程(GP)的灵活替代物。同样,本文件调查了在BO使用最先进的概率变压器的可行性。经过进一步调查,我们发现,由于培训程序和损失定义,这些模型直接部署成为黑箱优化的代理人,因此有两种缺陷。首先,我们注意到这些模型是就统一分布式输入器进行的培训,这损害了非统一化数据的预测准确性----这是由于勘探开发交易而导致的任何典型BO循环的设定。第二,我们认识到,在BO中采用最先进的概率变压变压变压变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变现器的可行性只是暂时的,在达到全球最佳时变现后,通常无法保证。然而,我们发现这些模型的下降性反应,特别是在输入空间的探索区。为了克服这些缺陷,我们引入了前期变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变变的