In this paper, we introduce the Preselection Bandit problem, in which the learner preselects a subset of arms (choice alternatives) for a user, which then chooses the final arm from this subset. The learner is not aware of the user's preferences, but can learn them from observed choices. In our concrete setting, we allow these choices to be stochastic and model the user's actions by means of the Plackett-Luce model. The learner's main task is to preselect subsets that eventually lead to highly preferred choices. To formalize this goal, we introduce a reasonable notion of regret and derive lower bounds on the expected regret. Moreover, we propose algorithms for which the upper bound on expected regret matches the lower bound up to a logarithmic term of the time horizon.
翻译:在本文中,我们引入了预选土匪问题, 学习者在其中预选了一个用户武器子集( 选择选项), 然后从子集中选择最后的手臂。 学习者不知道用户的偏好, 但可以从观察的选择中学习这些选择。 在我们的具体环境中, 我们允许这些选择是随机的, 并用Plackett- Luce模型来模拟用户的行动。 学习者的主要任务是预选最终导致高度偏好选择的子集。 为了正式确定这一目标, 我们引入一个合理的遗憾概念, 并在预期的遗憾中得出较低的界限。 此外, 我们提出一些算法, 期望的遗憾的上层能够匹配到时间界的对数术语的下限 。