Cascading bandits is a natural and popular model that frames the task of learning to rank from Bernoulli click feedback in a bandit setting. For the case of unstructured rewards, we prove matching upper and lower bounds for the problem-independent (i.e., gap-free) regret, both of which strictly improve the best known. A key observation is that the hard instances of this problem are those with small mean rewards, i.e., the small click-through rates that are most relevant in practice. Based on this, and the fact that small mean implies small variance for Bernoullis, our key technical result shows that variance-aware confidence sets derived from the Bernstein and Chernoff bounds lead to optimal algorithms (up to log terms), whereas Hoeffding-based algorithms suffer order-wise suboptimal regret. This sharply contrasts with the standard (non-cascading) bandit setting, where the variance-aware algorithms only improve constants. In light of this and as an additional contribution, we propose a variance-aware algorithm for the structured case of linear rewards and show its regret strictly improves the state-of-the-art.
翻译:连锁匪盗是一种自然和流行的模式,它决定了学习从伯努利到在土匪环境中的评分任务。对于无结构化的奖励,我们证明我们把问题独立的(即无差距的)遗憾与上下界限相匹配,两者都严格地改进了最众所周知的遗憾。一个关键的观察意见是,这个问题的难点实例是那些具有微小平均奖励的事例,即,在实际中最相关的小点击率。基于这一点,以及小平均值意味着Bernoullis的微小差异,我们的关键技术结果显示,从伯恩斯坦和切尔诺夫界限中得出的差异觉悟信心组合导致最佳算法(直到日志条件),而基于霍菲的算法则受到有条理的亚优劣感的遗憾。这与标准(无差距的)波段设置截然不同,其中差异性算法只会提高常数。根据这一点,作为额外的贡献,我们建议对线性报酬的结构化案例采用差异性算法,并展示其遗憾的严格改进状态。