The stochastic multi-armed bandit (MAB) problem is a common model for sequential decision problems. In the standard setup, a decision maker has to choose at every instant between several competing arms, each of them provides a scalar random variable, referred to as a "reward." Nearly all research on this topic considers the total cumulative reward as the criterion of interest. This work focuses on other natural objectives that cannot be cast as a sum over rewards, but rather more involved functions of the reward stream. Unlike the case of cumulative criteria, in the problems we study here the oracle policy, that knows the problem parameters a priori and is used to "center" the regret, is not trivial. We provide a systematic approach to such problems, and derive general conditions under which the oracle policy is sufficiently tractable to facilitate the design of optimism-based (upper confidence bound) learning policies. These conditions elucidate an interesting interplay between the arm reward distributions and the performance metric. Our main findings are illustrated for several commonly used objectives such as conditional value-at-risk, mean-variance trade-offs, Sharpe-ratio, and more.
翻译:多武装盗匪问题(MAB)是一个常见的相继决定问题模式。 在标准设置中,决策者必须每时每刻在几个相互竞争的军火之间做出选择,其中每个武器都提供一个标尺随机变量,称为“奖励 ” 。 几乎所有关于这个专题的研究都把累积性奖励的总额视为利息标准。 这项工作侧重于其他自然目标,这些自然目标不能作为优于奖励的总和,而是更多地涉及奖励流的功能。 与累积性标准的情况不同,在我们在这里研究的甲骨文政策中,这些累积性标准了解先验的问题参数,并用于“ 中心” 遗憾,这并非微不足道。 我们为这些问题提供了一种系统化的方法,并提出了总体条件,使甲骨文政策能够充分推动基于乐观( 增强信心) 的学习政策的设计。 这些条件说明了手臂奖励分配与业绩衡量标准之间的有趣相互作用。 我们的主要结论是用于一些通常使用的目标,例如有条件的价值风险、 平均逆差交易、 Share-ratio等。