We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We propose a UCB-inspired online learning algorithm, which we analyze theoretically in terms of its regret upper bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.
翻译:我们考虑古典多武装土匪问题的资源认知变体:在每一回合中,学习者选择一个手臂并确定资源限度,然后观察相应的(随机)奖赏,条件是消耗的资源(随机)数量仍然低于极限。否则,观察会受到审查,即没有获得奖赏。对于这个问题设置,我们引入了一种遗憾的度量,它包含每个学习回合的实际分配资源数量以及可实现的奖赏的最佳性。因此,为了减少遗憾,学习者需要设定资源限度,并选择一个手臂,这样,在预先确定的资源限度内实现高奖赏的机会就会很高,而资源限制本身应尽量保持低。我们建议由UCB启发的在线学习算法,我们从理论上分析它上层的遗憾。在模拟研究中,我们展示了我们的学习算法与标准的多武装土匪算法的直截面延伸。