We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number ($k$) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with $k=2$ suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from $O(\sqrt{nT})$ to $O(n^2 \log T)$, where $n$ is the number of arms and $T$ is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, $k=3$ probes suffice to achieve parameter-independent constant regret, $O(n^2)$. Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited ``advice'' via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.
翻译:我们考虑了典型的在线学习和多武装盗匪(MAB)问题,当每一步,在线政策都可以在作出选择之前先探寻并找出少数选择的一小部分(k$)中哪部分得到更好的奖励(或损失),在这个模型中,我们得出的算法,其遗憾界限比经典的悔恨界限对时间范围的依赖性强得多。特别是,我们展示出,用k=2美元进行测试足以实现在线线性优化和康韦克斯优化的基于时间的遗憾界限。同样数目的探针可以提高以美元(sqrt{nT})到美元($O(nqqrt{nT})到美元(nqknqrt\log T)的独立武器)的遗憾组合。在这个模型中,如果用美元来表示武器的数量和美元($T$)比经典的悔过长,我们也会考虑一个更强的模型,用来显示被调查的武器的奖赏值,并显示,在这种情况下, 美元=3美元足以实现参数依赖的MAB的常备量, 之后, 也能够显示,在游戏的准确度上显示。