Motivated by the fact that humans like some level of unpredictability or novelty, and might therefore get quickly bored when interacting with a stationary policy, we introduce a novel non-stationary bandit problem, where the expected reward of an arm is fully determined by the time elapsed since the arm last took part in a switch of actions. Our model generalizes previous notions of delay-dependent rewards, and also relaxes most assumptions on the reward function. This enables the modeling of phenomena such as progressive satiation and periodic behaviours. Building upon the Combinatorial Semi-Bandits (CSB) framework, we design an algorithm and prove a bound on its regret with respect to the optimal non-stationary policy (which is NP-hard to compute). Similarly to previous works, our regret analysis is based on defining and solving an appropriate trade-off between approximation and estimation. Preliminary experiments confirm the superiority of our algorithm over both the oracle greedy approach and a vanilla CSB solver.
翻译:人类喜欢某种程度的不可预测性或新奇,因此在与固定政策互动时可能很快感到无聊,我们为此引入了一个新的非固定性土匪问题,在这个问题上,一个手臂的预期报酬完全取决于自手臂上次参与行动转变以来的时间长度。我们的模型概括了以前依赖延迟的奖励概念,也放松了对奖励功能的大部分假设。这可以模拟进步营养和定期行为等现象。我们根据合并半银行框架设计了一种算法,并证明它对最佳非固定性政策(难以理解的NP-难以理解)的遗憾。与以往的工作一样,我们的遗憾分析基于界定和解决近似和估计之间的适当权衡。初步实验证实了我们的算法优于甲骨脂贪婪方法和香草CPB的解决者。