Motivated by problems of learning to rank long item sequences, we introduce a variant of the cascading bandit model that considers flexible length sequences with varying rewards and losses. We formulate two generative models for this problem within the generalized linear setting, and design and analyze upper confidence algorithms for it. Our analysis delivers tight regret bounds which, when specialized to vanilla cascading bandits, results in sharper guarantees than previously available in the literature. We evaluate our algorithms on a number of real-world datasets, and show significantly improved empirical performance as compared to known cascading bandit baselines.
翻译:以学习排列长项序列的问题为动力,我们引入了一个变式的级联土匪模型,该模型考虑灵活的长序序列,同时考虑不同的奖赏和损失。我们在一般线性设置中为这一问题制定两种基因模型,并为它设计和分析上层信任算法。我们的分析提供了严格的遗憾界限,这些界限在专门用于香草串联土匪时,比文献中以前提供的更加清晰的保证。我们评估了许多真实世界数据集的算法,并显示与已知的山轮盗基线相比,我们的经验性表现显著改善。