Building on the breakthrough of reinforcement learning, this paper introduces a unified framework of model-free reinforcement learning, CASA-B, Critic AS an Actor with Bandits Vote Algorithm. CASA-B is an actor-critic framework that estimates state-value, state-action-value and policy. An expectation-correct Doubly Robust Trace is introduced to learn state-value and state-action-value, whose convergence properties are guaranteed. We prove that CASA-B integrates a consistent path for the policy evaluation and the policy improvement. The policy evaluation is equivalent to a compensational policy improvement, which alleviates the function approximation error, and is also equivalent to an entropy-regularized policy improvement, which prevents the policy from collapsing to a suboptimal solution. Building on this design, we find the entropy of the behavior policies' and the target policy's are disentangled. Based on this observation, we propose a progressive closed-form entropy control mechanism, which explicitly controls the behavior policies' entropy to arbitrary range. Our experiments show that CASAB is super sample efficient and achieves State-Of-The-Art on Arcade Learning Environment. Our mean Human Normalized Score is 6456.63% and our median Human Normalized Score is 477.17%, under 200M training scale.
翻译:在强化学习突破的基础上,本文件引入了一个统一的无模式强化学习框架,即CASA-B, Crippic ASAS As As a Actor with Banits Police Alogor Agorithm。 CASA-B是一个对国家价值、州行动价值和政策进行估计的行为者-批评框架。根据这一设计,我们发现行为政策的缩略图和目标政策是分不开的。根据这一观察,我们建议逐步采用封闭式控制机制,明确控制行为政策与任意范围的关系。我们的实验显示,CASAB是超级抽样的,并且实现了我们正常的排名。