We study a sequential decision problem where the learner faces a sequence of $K$-armed stochastic bandit tasks. The tasks may be designed by an adversary, but the adversary is constrained to choose the optimal arm of each task in a smaller (but unknown) subset of $M$ arms. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). We design an algorithm based on a reduction to bandit submodular maximization, and show that, in the regime of large number of tasks and small number of optimal arms, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
翻译:我们研究一个顺序决定问题, 学习者面对的是一连串以K$武装的突击性土匪任务。 任务可能是由对手设计, 但对手只能选择每个任务在较小( 但未知) $M 的子集中的最佳手臂。 任务界限可能会为人所知( 土匪元学习设置), 或者未知( 非静止土匪设置) 。 我们设计了一种基于减少土匪亚调式最大化的算法, 并显示, 在大量任务和少量最佳武器的制度中, 它在两种环境中的遗憾都小于 $\ tilde{ O} (\ sqrt{ KNT} ) 的简单基线。 使用针对非静止土匪问题设计的标准算法, 任务界限可能为已知( 土匪元), 对于固定任务长度为$\ tau 的土匪元学习问题, 我们显示, 算法的遗憾被绑定为$tilde{O} (NM\\\\\\\\\\\ 3} M\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ 美元。 美元。 美元。