We study meta-learning for adversarial multi-armed bandits. We consider the online-within-online setup, in which a player (learner) encounters a sequence of multi-armed bandit episodes. The player's performance is measured as regret against the best arm in each episode, according to the losses generated by an adversary. The difficulty of the problem depends on the empirical distribution of the per-episode best arm chosen by the adversary. We present an algorithm that can leverage the non-uniformity in this empirical distribution, and derive problem-dependent regret bounds. This solution comprises an inner learner that plays each episode separately, and an outer learner that updates the hyper-parameters of the inner algorithm between the episodes. In the case where the best arm distribution is far from uniform, it improves upon the best bound that can be achieved by any online algorithm executed on each episode individually without meta-learning.
翻译:我们研究对抗性多武装强盗的元学习。 我们考虑的是在线内部设置, 让玩家( Learner) 遇到一系列多武装强盗事件。 玩家的表现根据对手造成的损失, 以每集中最好的手臂衡量为遗憾。 问题的难度取决于对手所选择的每集最佳手臂的经验分布。 我们提出了一个算法, 可以在这种经验分布中利用非统一性, 并得出基于问题的遗憾界限。 这个解法包括一个内部学习器, 每集单独播放一集, 以及一个外学器, 更新两集中内部算法的超参数。 如果最好的手臂分布远不统一, 它会改善每个集集不进行元学习的单集在线算法所能达到的最佳界限。