We study the fundamental limits of learning in contextual bandits, where a learner's rewards depend on their actions and a known context, which extends the canonical multi-armed bandit to the case where side-information is available. We are interested in universally consistent algorithms, which achieve sublinear regret compared to any measurable fixed policy, without any function class restriction. For stationary contextual bandits, when the underlying reward mechanism is time-invariant, [Blanchard et al.] characterized learnable context processes for which universal consistency is achievable; and further gave algorithms ensuring universal consistency whenever this is achievable, a property known as optimistic universal consistency. It is well understood, however, that reward mechanisms can evolve over time, possibly depending on the learner's actions. We show that optimistic universal learning for non-stationary contextual bandits is impossible in general, contrary to all previously studied settings in online learning -- including standard supervised learning. We also give necessary and sufficient conditions for universal learning under various non-stationarity models, including online and adversarial reward mechanisms. In particular, the set of learnable processes for non-stationary rewards is still extremely general -- larger than i.i.d., stationary or ergodic -- but in general strictly smaller than that for supervised learning or stationary contextual bandits, shedding light on new non-stationary phenomena.
翻译:我们研究背景强盗学习的根本局限,即学习者的报酬取决于他们的行动和已知的环境,将卡通式多武装强盗扩大到有侧面信息的情况。我们感兴趣的是普遍一致的算法,这种算法与任何可衡量的固定政策相比,实现了亚线性遗憾,没有任何功能等级限制。对于固定背景强盗来说,当基本奖励机制是时间差异时,[Blanchard等人]具有可学习的背景进程的特点,这种进程具有普遍一致性,可以实现普遍一致;进一步提供算法,只要能够实现,即确保普遍一致,财产被称为乐观的普遍一致性。然而,我们非常理解,奖励机制可以随着时间而演变,可能取决于学习者的行动。我们表明,与以前在网上学习中研究的所有环境不同,包括标准监督学习。我们还为在各种非常态模式下的普遍学习创造必要和充分的条件,包括在线和对抗性奖赏机制。特别是,非常态性奖赏的一套可学习过程,在总体上仍然比一般的或背景学习过程要大得多,但是在一般的状态上比一般的更小。