强盗强盗学习不完美背景 (Robust Bandit Learning with Imperfect Context)

A standard assumption in contextual multi-arm bandit is that the true context is perfectly known before arm selection. Nonetheless, in many practical applications (e.g., cloud resource management), prior to arm selection, the context information can only be acquired by prediction subject to errors or adversarial modification. In this paper, we study a contextual bandit setting in which only imperfect context is available for arm selection while the true context is revealed at the end of each round. We propose two robust arm selection algorithms: MaxMinUCB (Maximize Minimum UCB) which maximizes the worst-case reward, and MinWD (Minimize Worst-case Degradation) which minimizes the worst-case regret. Importantly, we analyze the robustness of MaxMinUCB and MinWD by deriving both regret and reward bounds compared to an oracle that knows the true context. Our results show that as time goes on, MaxMinUCB and MinWD both perform as asymptotically well as their optimal counterparts that know the reward function. Finally, we apply MaxMinUCB and MinWD to online edge datacenter selection, and run synthetic simulations to validate our theoretical analysis.

翻译：上下文多武器土匪的一个标准假设是,在选择手臂之前,真实的背景是完全知道的。然而,在许多实际应用中(如云层资源管理),在选择手臂之前,背景信息只能通过预测错误或对抗性修改才能获得。在本文中,我们研究了一个背景土匪环境,在每次回合结束时披露真实背景时,只能提供不完美的环境来选择手臂。我们建议两种强大的手臂选择算法:最大程度地利用最坏情况的奖励的Max MinUCB(最大程度的最小值UCB)和最大限度地减少最坏情况的退化的MinWD(最小度最坏情况退化),以尽量减少最坏情况的遗憾。重要的是,我们分析Max MinUCB和MINWD的稳健性,与了解真实背景的神器相比,我们得出遗憾和奖赏的界限。我们的结果显示,随着时间的流逝,Max MincucB和MinWD都以同样的方式和最优秀的对应方来履行奖赏功能。最后,我们应用Max Mincuard B和MinWD 来进行在线边缘数据中心的选择,并进行合成模拟来验证我们的理论分析。