We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
翻译:我们研究一个顺序决定问题,即学习者面对的是一连串$K$武装土匪任务。任务界限可能已知(土匪元学习设置),或未知(非静止土匪设置)。对于给定的整数$M\leK$,学习者的目标是与规模为$M美元的最佳武器子集进行竞争。我们设计了一个基于减少土匪亚调最大化的算法,并显示,对于由大量任务和少量最佳武器组成的价值为$N(NM\sqrt{M\\\\\N2/3}(非静止土匪设置)制度,在两种情况下,其遗憾都小于通过使用为非静止土匪问题设计的标准算法获得的简单基准。对于固定任务长度为$\tau$的土匪元学习问题,我们显示,在大量任务和少量最佳武器制度(M\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\MA3\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\1\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\