We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. An adversary may design the tasks, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). We design an algorithm based on a reduction to bandit submodular maximization and show that, in the regime of large number of tasks and small number of optimal arms, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
翻译:我们研究一个顺序决定问题, 即学习者面对的是一连串以K$武装的突击性土匪任务。 对手可以设计任务, 但对手只能选择一个小( 但未知) 美元武器子集中每个任务的最佳手臂。 任务界限可能会为人所知( 土匪元学习设置), 或者未知( 非静止土匪设置) 。 我们设计了一个基于减少强盗亚调最大化的算法, 并显示, 在大量任务和少量最佳武器的制度下, 它在两种环境中的遗憾都小于 $\ tilde{O} (\ sqrt{ KNT}) 的简单基线。 在使用为非静止土匪问题设计的标准算法时, 只能从中选择最佳的基调公式。 对于固定任务长度为$\taau 的土匪元学习问题, 我们展示了算法的遗憾与$tilde{O} (Nms\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\