This paper studies a new variant of the stochastic multi-armed bandits problem where auxiliary information about the arm rewards is available in the form of control variates. In many applications like queuing and wireless networks, the arm rewards are functions of some exogenous variables. The mean values of these variables are known a priori from historical data and can be used as control variates. Leveraging the theory of control variates, we obtain mean estimates with smaller variance and tighter confidence bounds. We develop an upper confidence bound based algorithm named UCB-CV and characterize the regret bounds in terms of the correlation between rewards and control variates when they follow a multivariate normal distribution. We also extend UCB-CV to other distributions using resampling methods like Jackknifing and Splitting. Experiments on synthetic problem instances validate performance guarantees of the proposed algorithms.
翻译:本文研究一种新型的随机多武装匪徒问题, 即以控制变异的形式提供关于手臂奖励的辅助信息。 在诸如排队和无线网络等许多应用中, 手臂奖励是某些外源变量的函数。 这些变量的平均值从历史数据中先验地得知, 并可以用作控制变异。 利用控制变异理论, 我们获得平均估计值, 差异较小, 信任界限更紧。 我们开发了一个以 UCB- CV 为主的上层信任约束算法, 并在奖励变异与控制变异之间的相关性方面描述遗憾界限, 当它们遵循多种变异的正常分布时。 我们还将UCB- CV推广到其他分配中, 使用Jackkfining 和 分解等重标方法。 合成问题实例实验验证了拟议算法的性能保障。