A recent line of research focuses on the study of the stochastic multi-armed bandits problem (MAB), in the case where temporal correlations of specific structure are imposed between the player's actions and the reward distributions of the arms (Kleinberg and Immorlica [FOCS18], Basu et al. [NeurIPS19]). As opposed to the standard MAB setting, where the optimal solution in hindsight can be trivially characterized, these correlations lead to (sub-)optimal solutions that exhibit interesting dynamical patterns -- a phenomenon that yields new challenges both from an algorithmic as well as a learning perspective. In this work, we extend the above direction to a combinatorial bandit setting and study a variant of stochastic MAB, where arms are subject to matroid constraints and each arm becomes unavailable (blocked) for a fixed number of rounds after each play. A natural common generalization of the state-of-the-art for blocking bandits, and that for matroid bandits, yields a $(1-\frac{1}{e})$-approximation for partition matroids, yet it only guarantees a $\frac{1}{2}$-approximation for general matroids. In this paper we develop new algorithmic ideas that allow us to obtain a polynomial-time $(1 - \frac{1}{e})$-approximation algorithm (asymptotically and in expectation) for any matroid, and thus to control the $(1-\frac{1}{e})$-approximate regret. A key ingredient is the technique of correlated (interleaved) scheduling. Along the way, we discover an interesting connection to a variant of Submodular Welfare Maximization, for which we provide (asymptotically) matching upper and lower approximability bounds.
翻译:最近的一行研究重点是研究多臂土匪问题(MAB),在这种情况下,在玩家的行动和军火的奖励分布(Kleinberg和Immorlica [FOCS18]、Basu 等人 [NeurIPS19])之间,强加了特定结构的时间相关性。相对于标准的MAB设置,在后视中的最佳解决方案可以被轻描淡写地描述,这些关联导致(次)最优化的解决方案,显示出有趣的动态模式 -- -- 一种既从算法角度又从学习角度产生新挑战的现象。在这项工作中,我们将上述方向扩展至组合型土匪和Immorlica[FCS18]、Basu 等人等人等人(NeurIPS19])。相对于标准的MAB设置,在每次播放后,每轮固定数都得不到最佳解决方案。对于堵住土匪来说,这些状态的常规常规常规常规常规常规通化,而对于配料土匪而言,(l-fortial-matial) 直径1(美元-ral-ral-ral-ral) 直径) 直径(美元-ral-ral-ral-lation) 连接。