深强化学习中的行为者-批评性数值非政策更正 (Off-Policy Correction for Actor-Critic Algorithms in Deep Reinforcement Learning)

Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) approaches that use previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing or exploding gradients. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce an alternative off-policy correction algorithm for continuous action spaces, Actor-Critic Off-Policy Correction (AC-Off-POC), to mitigate the potential drawbacks introduced by the previously collected data. Through a novel discrepancy measure computed by the agent's most recent action decisions on the states of the randomly sampled batch of transitions, the approach does not require actual or estimated action probabilities for any policy and offers an adequate one-step importance sampling. Theoretical results show that the introduced approach can achieve a contraction mapping with a fixed unique point, which allows a "safe" off-policy learning. Our empirical results suggest that AC-Off-POC consistently improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.

翻译：与政策梯度技术相比,使用先前收集的数据的非政策模式深度强化学习(RL)方法可以提高抽样效率;然而,当利益政策的分配与收集数据的政策之间的差异加大时,政策外学习就具有挑战性;虽然提出了经过充分研究的重要性抽样和政策外政策梯度技术,以弥补这一差异,但通常需要收集长的轨迹,增加计算复杂性,并引起更多的问题,如梯度消失或爆炸。此外,由于它们向连续行动领域的概括性严格有限,因为它们需要行动概率,这不适合确定性政策。为了克服这些限制,我们为连续行动空间、Acor-Conctic-政策校正(AC-POC)提出了替代政策外校正算算算法,以缓解先前收集的数据带来的潜在偏差。通过代理人最近对随机抽样过渡阶段状况的行动决定的计算,该方法并不要求实际或估计行动概率,这不适合确定性政策。为了克服这些限制,我们为持续行动空间、行动外校正校正(A-C)引入了一种稳定的学习方法,从而得出一种固定的排序结果。