We study a ranking problem in the contextual multi-armed bandit setting. A learning agent selects an ordered list of items at each time step and observes stochastic outcomes for each position. In online recommendation systems, showing an ordered list of the most attractive items would not be the best choice since both position and item dependencies result in a complicated reward function. A very naive example is the lack of diversity when all the most attractive items are from the same category. We model position and item dependencies in the ordered list and design UCB and Thompson Sampling type algorithms for this problem. We prove that the regret bound over $T$ rounds and $L$ positions is $\Tilde{O}(L\sqrt{d T})$, which has the same order as the previous works with respect to $T$ and only increases linearly with $L$. Our work generalizes existing studies in several directions, including position dependencies where position discount is a particular case, and proposes a more general contextual bandit model.
翻译:我们研究的是背景多武装土匪背景中的排名问题。 一个学习代理机构在每一个步骤中选择一个有顺序的项目列表,并观察每个位置的随机结果。 在在线建议系统中,显示一个有顺序的最具吸引力的项目列表并不是最佳选择,因为职位和项目依赖性都会产生复杂的奖赏功能。 一个非常天真的例子就是当所有最有吸引力的项目都来自同一类别时,缺乏多样性。 我们在有顺序的列表和设计 UCB 和 Thompson 抽样算法中为这一问题建模位置和项目依赖性。 我们证明,对$T 和 $ 的回合和 $L 的遗憾是$\ Tilde{O} (L\ qrt{d T}) 和$ $(L\\ sqrt{T} ), 其顺序与以前关于$T的工程相同, 仅以美元线性增加。 我们的工作在几个方向上概括了现有的研究, 包括位置依赖性特定案例的位置, 并提议一个更笼统的连带模型。