We propose a novel modification of the standard upper confidence bound (UCB) method for the stochastic multi-armed bandit (MAB) problem which tunes the confidence bound of a given bandit based on its distance to others. Our UCB distance tuning (UCB-DT) formulation enables improved performance as measured by expected regret by preventing the MAB algorithm from focusing on non-optimal bandits which is a well-known deficiency of standard UCB. "Distance tuning" of the standard UCB is done using a proposed distance measure, which we call bandit distance, that is parameterizable and which therefore can be optimized to control the transition rate from exploration to exploitation based on problem requirements. We empirically demonstrate increased performance of UCB-DT versus many existing state-of-the-art methods which use the UCB formulation for the MAB problem. Our contribution also includes the development of a conceptual tool called the "Exploration Bargain Point" which gives insights into the tradeoffs between exploration and exploitation. We argue that the Exploration Bargain Point provides an intuitive perspective that is useful for comparatively analyzing the performance of UCB-based methods.
翻译:我们建议对标准多武装土匪(MAB)标准上限信任度(UCB)方法进行新颖的修改,该方法根据与他人的距离调控特定土匪的信任度。我们的UCB远程调频(UCB-DT)的配方能够根据预期的遗憾度度提高性能,防止MAB算法侧重于非最佳土匪,这是标准UCB的一个众所周知的缺陷。标准UCB的“差异调控”是使用一种拟议的距离测量方法完成的,我们称之为土匪距离,这是可测量的,因此可以优化,以控制从勘探到开发的过渡率,基于问题的要求。我们从经验上表明,UCB-DT的性能高于使用UCB的配方的许多现有最先进的方法。我们的贡献还包括开发一个概念工具,即“勘探讨价还点”,以了解勘探与开采之间的利关系。我们认为,“探索讨价价”提供了一种直观的观点,有助于比较分析UCBB基方法的性能。