Policy optimization methods remain a powerful workhorse in empirical Reinforcement Learning (RL), with a focus on neural policies that can easily reason over complex and continuous state and/or action spaces. Theoretical understanding of strategic exploration in policy-based methods with non-linear function approximation, however, is largely missing. In this paper, we address this question by designing ENIAC, an actor-critic method that allows non-linear function approximation in the critic. We show that under certain assumptions, e.g., a bounded eluder dimension $d$ for the critic class, the learner finds a near-optimal policy in $O(\poly(d))$ exploration rounds. The method is robust to model misspecification and strictly extends existing works on linear function approximation. We also develop some computational optimizations of our approach with slightly worse statistical guarantees and an empirical adaptation building on existing deep RL tools. We empirically evaluate this adaptation and show that it outperforms prior heuristics inspired by linear methods, establishing the value via correctly reasoning about the agent's uncertainty under non-linear function approximation.
翻译:政策优化方法仍然是经验强化学习(RL)的有力工作,重点是能够很容易理解复杂和持续状态和/或行动空间的神经政策。但是,对政策型方法中战略探索和非线性函数近似法的理论理解基本上缺乏。在本文件中,我们通过设计ENIAC来解决这个问题,ENIAC是一种允许评论家非线性功能近似法的行为者-批评方法。我们从经验上评估了这种适应,并表明在某些假设下,例如,在评论家阶级的捆绑的远方维度值$d$d,学习者在$O(\poly(d))勘探回合中发现一种接近最佳的政策。该方法非常健全,可以模拟错误的特性,严格扩展线性功能近似近的工程。我们还以略微差的统计保证和在现有深度RL工具基础上的经验适应方法开发了我们的方法的一些计算优化。我们从经验上评估了这一适应情况,并表明它比线性方法所激发的前超度值,通过正确推理该代理人非线性功能近的不确定性的价值。