Recent work has considered natural variations of the multi-armed bandit problem, where the reward distribution of each arm is a special function of the time passed since its last pulling. In this direction, a simple (yet widely applicable) model is that of blocking bandits, where an arm becomes unavailable for a deterministic number of rounds after each play. In this work, we extend the above model in two directions: (i) We consider the general combinatorial setting where more than one arms can be played at each round, subject to feasibility constraints. (ii) We allow the blocking time of each arm to be stochastic. We first study the computational/unconditional hardness of the above setting and identify the necessary conditions for the problem to become tractable (even in an approximate sense). Based on these conditions, we provide a tight analysis of the approximation guarantee of a natural greedy heuristic that always plays the maximum expected reward feasible subset among the available (non-blocked) arms. When the arms' expected rewards are unknown, we adapt the above heuristic into a bandit algorithm, based on UCB, for which we provide sublinear (approximate) regret guarantees, matching the theoretical lower bounds in the limiting case of absence of delays.
翻译:最近的工作考虑了多武装土匪问题的自然变化,每个手臂的奖赏分配是自上次拉动以来所经过时间的特殊功能。在这个方向上,一个简单的(目前广泛适用)模式是阻拦土匪的模式,每场比赛后,一个手臂无法用于确定多少轮子。在这项工作中,我们将上述模式分为两个方向:(一) 我们考虑总组合设置,每轮可以玩一个以上的手臂,但需受可行性限制。 (二) 我们允许每个手臂的阻塞时间是随机的。我们首先研究上述设置的计算/无条件硬性,并找出使问题变得易于处理的必要条件(即使是在大致意义上)。根据这些条件,我们对自然贪婪的狂妄主义的近似保证进行了严格分析,这种保证总是在现有的(未封锁的)武器中发挥最大预期的可行分数。当武器预期的奖赏未知时,我们将上面的黑奴主义改成以UCB为基础的匪算法,我们为此提供了次线性延迟的理论保证。