对多标准、多武装、多武装匪徒进行限制,尽量减少对多标准多武装匪徒的遗憾 (Constrained regret minimization for multi-criterion multi-armed bandits)

We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a 'primary' attribute subject to user-provided constraints on other 'secondary' attributes. We assume that the attributes can be estimated using samples from the arms' distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a Boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.

翻译：我们考虑的是随机多武装匪徒设置,并研究在特定时间范围内限制遗憾最小化的问题。每只臂与未知的、可能多维分布有关,一个臂的优点由若干可能相互冲突的属性决定。目的是优化“初级”属性,但受用户提供的对其它“第二”属性的限制。我们假设这些属性可以使用武器分布的样本来估计,而且估计者享有适当的集中特性。我们提议了一个称为“Con-LCB”的算法,保证对数的遗憾,即所有非最佳武器的玩耍平均数量在地平线上最多为对数。算法还输出出一个“Boolean”标志,该标志非常有可能正确地确定给定的参数是否可行/不可行。我们还表明,Con-LCB在普遍常态中是最佳的,也就是说,更精密的算法无法做到更普遍的集中特性。最后,我们在最小化和可行性识别所有非最佳武器的玩耍之间确立了一种基本交易。我们的框架在财政组合中找到自然应用,例如,在最优化中,预期的回收是有意义的。