In this paper we study a multi-arm bandit problem in which the quality of each arm is measured by the Conditional Value at Risk (CVaR) at some level alpha of the reward distribution. While existing works in this setting mainly focus on Upper Confidence Bound algorithms, we introduce a new Thompson Sampling approach for CVaR bandits on bounded rewards that is flexible enough to solve a variety of problems grounded on physical resources. Building on a recent work by Riou & Honda (2020), we introduce B-CVTS for continuous bounded rewards and M-CVTS for multinomial distributions. On the theoretical side, we provide a non-trivial extension of their analysis that enables to theoretically bound their CVaR regret minimization performance. Strikingly, our results show that these strategies are the first to provably achieve asymptotic optimality in CVaR bandits, matching the corresponding asymptotic lower bounds for this setting. Further, we illustrate empirically the benefit of Thompson Sampling approaches both in a realistic environment simulating a use-case in agriculture and on various synthetic examples.
翻译:在本文中,我们研究了一个多臂强盗问题,其中每只手臂的质量都通过风险条件值(CVaR)在某种水平的奖赏分配中进行测量。虽然在这一背景下现有的工作主要侧重于高信任率算法,但我们为CVaR匪徒采用了一种新的Thompson抽样方法,其约束性奖赏足够灵活,足以解决基于物质资源的各种问题。根据Riou & Honda(20202020年)最近的一项工作,我们引入B-CVTS,以持续受约束的奖赏和M-CVTS,用于多种名牌分配。在理论上,我们提供了非三重扩展的分析,以便能够在理论上约束其CVaR最低程度的绩效。 令人惊讶的是,我们的结果表明,这些战略是第一个在CVaR匪徒中实现无症状的最佳性,与这一环境相应的微调较低界限相匹配。此外,我们从经验上展示了Thompson Sampling方法在现实环境中在模拟农业和各种合成案例中的好处。