Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we reduce this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.
翻译:整合目标网络的行为者-批评方法在深层强化学习中表现出了惊人的成功经验。然而,文献中基本上缺乏对在行为者-批评方法中使用目标网络的理论理解。在本文中,我们缩小理论与实践之间的差距,方法是提出对在线基于目标的行为者-批评算法的首次理论分析,在折扣奖励设置中以线性功能为近似值。我们的算法使用三种不同的时间尺度:一个用于行为者,两个用于评论者。我们使用标准单一时间尺度的时间尺度差异(TD)学习算法作为批评者,而不是使用两种基于时间尺度的基于目标的TD学习,这是从实施目标网络的实用的行为者-批评算法中密切启发的。首先,我们为评论者和Markovian取样的行为者确定非象征性的趋同结果。然后,我们提供一次时间分析,显示将目标网络纳入行为者-批评方法的影响。