Goal-conditioned reinforcement learning (GCRL) refers to learning general-purpose skills which aim to reach diverse goals. In particular, offline GCRL only requires purely pre-collected datasets to perform training tasks without additional interactions with the environment. Although offline GCRL has become increasingly prevalent and many previous works have demonstrated its empirical success, the theoretical understanding of efficient offline GCRL algorithms is not well established, especially when the state space is huge and the offline dataset only covers the policy we aim to learn. In this paper, we propose a novel provably efficient algorithm (the sample complexity is $\tilde{O}({\rm poly}(1/\epsilon))$ where $\epsilon$ is the desired suboptimality of the learned policy) with general function approximation. Our algorithm only requires nearly minimal assumptions of the dataset (single-policy concentrability) and the function class (realizability). Moreover, our algorithm consists of two uninterleaved optimization steps, which we refer to as $V$-learning and policy learning, and is computationally stable since it does not involve minimax optimization. To the best of our knowledge, this is the first algorithm with general function approximation and single-policy concentrability that is both statistically efficient and computationally stable.
翻译:强化目标的强化学习(GCRL) 指的是学习一般目的技能,目的是达到不同的目标。 特别是, 离线 GCRL 仅仅要求纯粹的预收集数据集来完成培训任务而不与环境进行更多互动。 虽然离线 GCRL 越来越普遍, 许多先前的作品都证明了它的成功经验, 但对于高效的离线 GCRL 算法的理论理解并没有很好地确立, 特别是当国家空间巨大, 离线数据集仅涵盖我们想要学习的政策时。 在本文中, 我们提议一种创新的高效算法( 样本复杂性是 $\ tilde{O} ( rm polil} ( 1/\\ epsilon) ) 。 在这种算法中, $\ epsilon是所学政策的预期的亚优性, 并且我们所学政策的精准性 。 我们的计算方法只需要几乎最低限度的数据集( ) ( ) 和 函数类( 可变性 ) ( ) 。 此外, 我们的算法包括两个未互换的优化的优化步骤, 我们称之为$V$ 学习和政策学习和政策学习和政策学习和政策学习 。 。 和计算中的第一个 。