Conservative mechanism is a desirable property in decision-making problems which balance the tradeoff between the exploration and exploitation. We propose the novel \emph{conservative contextual combinatorial cascading bandit ($C^4$-bandit)}, a cascading online learning game which incorporates the conservative mechanism. At each time step, the learning agent is given some contexts and has to recommend a list of items but not worse than the base strategy and then observes the reward by some stopping rules. We design the $C^4$-UCB algorithm to solve the problem and prove its n-step upper regret bound for two situations: known baseline reward and unknown baseline reward. The regret in both situations can be decomposed into two terms: (a) the upper bound for the general contextual combinatorial cascading bandit; and (b) a constant term for the regret from the conservative mechanism. The algorithm can be directly applied to the search engine and recommender system. Experiments on synthetic data demonstrate its advantages and validate our theoretical analysis.
翻译:保守机制是平衡勘探与开发之间平衡的决策问题中的一种可取的财产。 我们提议采用新颖的 \ emph{ 保守背景组合式组合式带式带宽土匪(C$4$-bandit)},这是一个包含保守机制的连锁在线学习游戏。 每一次步骤,学习代理人都有一定的背景,必须建议一份项目清单,但不得比基本战略更差,然后通过一些停止规则来观察奖励。 我们设计了$C$4$-UCB算法,以解决问题,并证明它对于两种情况(已知基线奖赏和未知基线奖赏)具有正步的上层后悔感。两种情况下的遗憾可以分解为两个条件:(a) 总背景组合带宽阔带带宽;(b) 保守机制的遗憾持续期。算法可以直接应用于搜索引擎和建议系统。合成数据的实验显示了其优点并证实了我们的理论分析。