In this paper, we investigate the impact of diverse user preference on learning under the stochastic multi-armed bandit (MAB) framework. We aim to show that when the user preferences are sufficiently diverse and each arm can be optimal for certain users, the O(log T) regret incurred by exploring the sub-optimal arms under the standard stochastic MAB setting can be reduced to a constant. Our intuition is that to achieve sub-linear regret, the number of times an optimal arm being pulled should scale linearly in time; when all arms are optimal for certain users and pulled frequently, the estimated arm statistics can quickly converge to their true values, thus reducing the need of exploration dramatically. We cast the problem into a stochastic linear bandits model, where both the users preferences and the state of arms are modeled as {independent and identical distributed (i.i.d)} d-dimensional random vectors. After receiving the user preference vector at the beginning of each time slot, the learner pulls an arm and receives a reward as the linear product of the preference vector and the arm state vector. We also assume that the state of the pulled arm is revealed to the learner once its pulled. We propose a Weighted Upper Confidence Bound (W-UCB) algorithm and show that it can achieve a constant regret when the user preferences are sufficiently diverse. The performance of W-UCB under general setups is also completely characterized and validated with synthetic data.
翻译:在本文中,我们调查了不同用户偏好在多武装盗匪(MAB)框架下学习的不同用户偏好的影响。我们的目标是表明,当用户偏好足够多样化,每个手臂对特定用户最合适时,探索标准随机型MAB设置下的亚最佳手臂而引发的O(log T)遗憾可以降为不变。我们的直觉是,为了实现亚线性遗憾,拉动最佳手臂的次数应该及时线性地按比例缩放;当所有手臂对某些用户最合适并经常拉动时,估计的手臂统计数据可以迅速接近其真实价值,从而大大降低勘探需求。我们把问题投放到一个随机性线性线性匪型模型中,在这个模型中,用户偏好和武器状态的模型都以{独立和相同的分布为模型(i.d)d}维度随机矢量为模型。在每次开始收到用户偏好矢量矢量矢量矢量的矢量后,学习者拉动手臂的奖励,作为偏重度产品和手臂矢量矢量的直线性产品,从而大大降低勘探的需要。我们还假设,一旦我们展示了不断展示的B级的状态,我们展示了它。