Conservative mechanism is a desirable property in decision-making problems which balance the tradeoff between the exploration and exploitation. We propose the novel \emph{conservative contextual combinatorial cascading bandit ($C^4$-bandit)}, a cascading online learning game which incorporates the conservative mechanism. At each time step, the learning agent is given some contexts and has to recommend a list of items but not worse than the base strategy and then observes the reward by some stopping rules. We design the $C^4$-UCB algorithm to solve the problem and prove its n-step upper regret bound for two situations: known baseline reward and unknown baseline reward. The regret in both situations can be decomposed into two terms: (a) the upper bound for the general contextual combinatorial cascading bandit; and (b) a constant term for the regret from the conservative mechanism. We also improve the bound of the conservative contextual combinatorial bandit as a by-product. Experiments on synthetic data demonstrate its advantages and validate our theoretical analysis.
翻译:保守机制是平衡勘探与开发之间平衡的决策问题中的一种可取的财产。 我们提议了新颖的 \ emph{ 保守背景组合式连锁条纹(C$4$-bandit)},这是一个包含保守机制的连锁在线学习游戏。 在每一个步骤中,学习代理都给出了一些背景,必须建议一份项目清单,但并不比基本战略更差,然后通过一些停止规则来观察奖励。我们设计了$C$4$-UCB算法来解决问题,并证明它对于两种情况(已知基线奖赏和未知基线奖赏)具有正步的上层后悔感。两种情况下的遗憾可以分解为两个条件:(a) 总背景组合式连锁条纹带;(b) 保守机制的遗憾持续期。我们还改进了保守背景组合带带带作为副产品的约束。关于合成数据的实验显示了其优势并证实了我们的理论分析。