We present a new type of acquisition functions for online decision making in multi-armed and contextual bandit problems with extreme payoffs. Specifically, we model the payoff function as a Gaussian process and formulate a novel type of upper confidence bound (UCB) acquisition function that guides exploration towards the bandits that are deemed most relevant according to the variability of the observed rewards. This is achieved by computing a tractable likelihood ratio that quantifies the importance of the output relative to the inputs and essentially acts as an \textit{attention mechanism} that promotes exploration of extreme rewards. We demonstrate the benefits of the proposed methodology across several synthetic benchmarks, as well as a realistic example involving noisy sensor network data. Finally, we provide a JAX library for efficient bandit optimization using Gaussian processes.
翻译:我们展示了一种新型的在线决策功能,用于在多武装和背景土匪问题和极端回报问题上进行在线决策。 具体地说,我们将付款功能作为高山进程模型,并设计一种新型的高度信任(UB)获取功能,指导对根据观察到的奖励的变异性而被认为最相关的匪徒的勘探。 实现这一目的的方法是计算一种可移动的可能性比率,该比率可以量化产出与投入的相对重要性,并基本上起到促进探索极端奖励的\ textit{atention机制}的作用。 我们展示了拟议方法在几个合成基准中的益处,以及涉及噪音传感器网络数据的一个现实实例。 最后,我们提供了一个JAX图书馆,以便利用高山进程高效优化土匪。