Policy regret is a well established notion of measuring the performance of an online learning algorithm against an adaptive adversary. We study restrictions on the adversary that enable efficient minimization of the \emph{complete policy regret}, which is the strongest possible version of policy regret. We identify a gap in the current theoretical understanding of what sorts of restrictions permit tractability in this challenging setting. To resolve this gap, we consider a generalization of the stochastic multi armed bandit, which we call the \emph{tallying bandit}. This is an online learning setting with an $m$-memory bounded adversary, where the average loss for playing an action is an unknown function of the number (or tally) of times that the action was played in the last $m$ timesteps. For tallying bandit problems with $K$ actions and time horizon $T$, we provide an algorithm that w.h.p achieves a complete policy regret guarantee of $\tilde{\mathcal{O}}(mK\sqrt{T})$, where the $\tilde{\mathcal{O}}$ notation hides only logarithmic factors. We additionally prove an $\tilde\Omega(\sqrt{m K T})$ lower bound on the expected complete policy regret of any tallying bandit algorithm, demonstrating the near optimality of our method.
翻译:政策遗憾是衡量在线学习算法对适应性对手的性能的既定概念。 我们研究对对手的限制, 从而能够有效地最大限度地减少 emph{ 完整的政策遗憾 。 这是政策遗憾的最强版本 。 我们发现目前对何种限制的理论理解存在差距, 在这种富有挑战性的环境下, 什么样的限制允许可移动。 为了解决这一差距, 我们考虑对多武装匪帮进行概括化, 我们称之为“ emph{ tallying bandit} ” 。 这是一个使用 $m- mory 约束性对手的在线学习设置 。 在这种设置中, 玩一个动作的平均损失是该动作在最后一百万美元时间段数( 或数 ) 的未知函数 。 对于使用美元动作和时间范围 T. h. p 的拼凑问题, 我们提供一种算法, 这个算法可以实现 $tilde {( mK\ sqrt} (mkrt{ t} $) 。 这是一个完全的政策保证 $troforprom supility exliction exliction $naltiquestrationaltiquesteal 。