We consider the problem of interactive decision making, encompassing structured bandits and reinforcement learning with general function approximation. Recently, Foster et al. (2021) introduced the Decision-Estimation Coefficient, a measure of statistical complexity that lower bounds the optimal regret for interactive decision making, as well as a meta-algorithm, Estimation-to-Decisions, which achieves upper bounds in terms of the same quantity. Estimation-to-Decisions is a reduction, which lifts algorithms for (supervised) online estimation into algorithms for decision making. In this note, we show that by combining Estimation-to-Decisions with a specialized form of optimistic estimation introduced by Zhang (2022), it is possible to obtain guarantees that improve upon those of Foster et al. (2021) by accommodating more lenient notions of estimation error. We use this approach to derive regret bounds for model-free reinforcement learning with value function approximation.
翻译:我们考虑了互动决策问题,包括结构化强盗和以一般功能近似法强化学习。最近,Foster等人(2021年)推出了“决定-估计系数”这一统计复杂性的计量标准,该计量标准降低了互动决策的最佳遗憾,并降低了对互动决策的最佳遗憾程度,还提出了“元乘数、估计至决定”的元乘数,在数量上达到了上限。“估计至决定”是一种减少,将(监督的)在线估算的算法提升为决策的算法。在本说明中,我们表明,通过将估计与张(2022年)提出的一种特别形式的乐观估算相结合,可以通过更宽容的估算错误概念获得改善Foster等人(2021年)的保证。我们用这种方法为无模型强化学习与价值功能近似值而引出遗憾界限。