A fundamental challenge in interactive learning and decision making, ranging from bandit problems to reinforcement learning, is to provide sample-efficient, adaptive learning algorithms that achieve near-optimal regret. This question is analogous to the classical problem of optimal (supervised) statistical learning, where there are well-known complexity measures (e.g., VC dimension and Rademacher complexity) that govern the statistical complexity of learning. However, characterizing the statistical complexity of interactive learning is substantially more challenging due to the adaptive nature of the problem. The main result of this work provides a complexity measure, the Decision-Estimation Coefficient, that is proven to be both necessary and sufficient for sample-efficient interactive learning. In particular, we provide: 1. a lower bound on the optimal regret for any interactive decision making problem, establishing the Decision-Estimation Coefficient as a fundamental limit. 2. a unified algorithm design principle, Estimation-to-Decisions (E2D), which transforms any algorithm for supervised estimation into an online algorithm for decision making. E2D attains a regret bound matching our lower bound, thereby achieving optimal sample-efficient learning as characterized by the Decision-Estimation Coefficient. Taken together, these results constitute a theory of learnability for interactive decision making. When applied to reinforcement learning settings, the Decision-Estimation Coefficient recovers essentially all existing hardness results and lower bounds. More broadly, the approach can be viewed as a decision-theoretic analogue of the classical Le Cam theory of statistical estimation; it also unifies a number of existing approaches -- both Bayesian and frequentist.
翻译:互动学习和决策,从强盗问题到强化学习,其根本挑战是提供抽样高效的适应性学习算法,实现接近最佳效果的遗憾。这个问题类似于典型的优化(监督)统计学习问题,因为有众所周知的复杂度(如VC维度和Rademacher复杂度)来规范学习的统计复杂性。然而,由于问题具有适应性,互动学习的统计复杂性要具有更大的挑战性。这项工作的主要结果提供了一种复杂度,即决定-估计效率,这已证明对于抽样高效互动学习既必要,又足够。特别是,我们提供了:1. 对任何互动决策问题的最佳遗憾程度较低(监督),将决定-激励-决定(E2D)作为统一的算法设计原则,将监督估算的任何算法转化为决策的在线算法。E2D取得了一种不甚精确的测量,从而实现最优化的抽样-高效的统计互动学习方法,作为整个决策-周期性学习结果的更精确性,成为了当前整个决策-Conimical 的精确度,作为整个决策-Conimimal 的恢复。