Lipschitz bandits is a prominent version of multi-armed bandits that studies large, structured action spaces such as the [0,1] interval, where similar actions are guaranteed to have similar rewards. A central theme here is the adaptive discretization of the action space, which gradually "zooms in" on the more promising regions thereof. The goal is to take advantage of "nicer" problem instances, while retaining near-optimal worst-case performance. While the stochastic version of the problem is well-understood, the general version with adversarial rewards is not. We provide the first algorithm for adaptive discretization in the adversarial version, and derive instance-dependent regret bounds. In particular, we recover the worst-case optimal regret bound for the adversarial version, and the instance-dependent regret bound for the stochastic version. Further, an application of our algorithm to dynamic pricing (where a seller repeatedly adjusts prices for a product) enjoys these regret bounds without any smoothness assumptions.
翻译:Lipschitz 土匪是多武装强盗的突出版本,它研究的是大型、结构化的行动空间,如[0,1]间隔,保证类似行动得到类似的回报。这里的一个中心主题是行动空间的适应性分解,在较有希望的区域逐渐地“分解 ” 。目标是利用“温和”问题实例,同时保留近乎最佳的最坏的性能。虽然对问题的随机应变版本是广为人知的,但关于对抗性奖励的通用版本却并非如此。我们为对抗性版本的适应性分解提供了第一种算法,并得出了以实例为依据的遗憾界限。特别是,我们恢复了对对抗性版本最坏情况的最佳遗憾,以及依实例而定的版本。此外,我们的算法应用于动态定价(即卖方反复调整产品价格)时,在没有任何顺畅的假设的情况下享有这些遗憾界限。