A central problem in online learning and decision making -- from bandits to reinforcement learning -- is to understand what modeling assumptions lead to sample-efficient learning guarantees. We consider a general adversarial decision making framework that encompasses (structured) bandit problems with adversarial rewards and reinforcement learning problems with adversarial dynamics. Our main result is to show -- via new upper and lower bounds -- that the Decision-Estimation Coefficient, a complexity measure introduced by Foster et al. in the stochastic counterpart to our setting, is necessary and sufficient to obtain low regret for adversarial decision making. However, compared to the stochastic setting, one must apply the Decision-Estimation Coefficient to the convex hull of the class of models (or, hypotheses) under consideration. This establishes that the price of accommodating adversarial rewards or dynamics is governed by the behavior of the model class under convexification, and recovers a number of existing results -- both positive and negative. En route to obtaining these guarantees, we provide new structural results that connect the Decision-Estimation Coefficient to variants of other well-known complexity measures, including the Information Ratio of Russo and Van Roy and the Exploration-by-Optimization objective of Lattimore and Gy\"{o}rgy.
翻译:在线学习和决策 -- -- 从强盗到强化学习 -- -- 的中心问题是了解什么建模假设会导致抽样有效的学习保障。我们考虑一个一般性的对抗性决策框架,其中包括(结构化的)有对抗性奖赏的土匪问题,以及强化对抗性动态的学习问题。我们的主要结果是通过新的上下界限显示,Foster等人在对口的对口单位中引入的决定-Estatectimation Comparative(从强盗到强化学习)是一项复杂措施 -- -- 无论是正面还是负面的 -- -- 是必要的,足以为对抗性决策获得低遗憾。然而,与随机化环境相比,我们必须将决定-Empactive(决定-Eparative)适用于正在审议的各类模型(或假设)的混凝聚体。这说明,适应性对抗性奖赏或动态的价格是由正在混化的模型类别的行为所决定的,并收回一些既有结果 -- 正面和负面的。在获得这些保证的路上,我们提供了新的结构结果,将决定-Emimation-imation Constimation Covalization)与其他众所周知的复杂措施的变体,包括Rust和Vestimational-Ils和Glationaltial-Ils和Vestimidality和Girgirgyal。