This paper introduces a novel design of model-free reinforcement learning, CASA, Critic AS an Actor. CASA follows the actor-critic framework that estimates state-value, state-action-value and policy simultaneously. We prove that CASA integrates a consistent path for the policy evaluation and the policy improvement, which completely eliminates the gradient conflict between the policy improvement and the policy evaluation. The policy evaluation is equivalent to a compensational policy improvement, which alleviates the function approximation error, and is also equivalent to an entropy-regularized policy improvement, which prevents the policy from being trapped into a suboptimal solution. Building on this design, an expectation-correct Doubly Robust Trace is introduced to learn state-value and state-action-value, and the convergence is guaranteed. Our experiments show that the design achieves State-Of-The-Art on Arcade Learning Environment.
翻译:本文介绍了无模型强化学习的新设计,即CASA、CRit AS As a Actor。CASA遵循同时估计国家价值、国家行动价值和政策的行为者-批评框架。我们证明,CASA结合了政策评价和政策改进的一致道路,从而完全消除了政策改进和政策评价之间的梯度冲突。政策评价相当于补偿性政策改进,这减轻了功能近似错误,也相当于对政策进行昆虫常规化改进,防止政策陷入不理想的解决方案。在这个设计的基础上,引入了一种对期望的正确的Doubly Robust Trace,以学习国家价值和州行动价值,并保证了这种趋同。我们的实验表明,设计实现了Arcade学习环境的“国家-地方-艺术”设计。