Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. We also generalize the DEC to give sample-efficient algorithms for all-policy model estimation, with applications for learning equilibria in Markov Games. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.
翻译:为样本效率高的学习寻找统一的复杂度措施和算法是强化学习(RL)研究的一个中心主题。Foster等人(2021年)最近提议,决定-估计系数(DEC)是Foster等人(2021年)最近提出的一个决定-估计系数(DEC),作为抽样效率高的无Regret RL(2021年)的一项必要和足够复杂的衡量标准。本文件将逐步与DEC框架为RL统一理论。首先,我们提出两种新的DEC类型复杂度措施:探索DEC(DEC)和无奖励DEC(RDEC)。我们表明,它们对于样本效率高的PAC学习和无报酬学习来说是必要和充分的,从而扩大最初的DEC(DEC)应用,我们设计了新的统一的样本效率算法。我们的数据E2-TA(Eregral-L)的计算方法, 也可以在目前的一个常规模型下,在Oral-ral-ral Exlial Exliaral 上, 学习新的 RFIL(2021) 。