Actor-critic methods integrating target networks have exhibited a stupendous empirical success in deep reinforcement learning. However, a theoretical understanding of the use of target networks in actor-critic methods is largely missing in the literature. In this paper, we bridge this gap between theory and practice by proposing the first theoretical analysis of an online target-based actor-critic algorithm with linear function approximation in the discounted reward setting. Our algorithm uses three different timescales: one for the actor and two for the critic. Instead of using the standard single timescale temporal difference (TD) learning algorithm as a critic, we use a two timescales target-based version of TD learning closely inspired from practical actor-critic algorithms implementing target networks. First, we establish asymptotic convergence results for both the critic and the actor under Markovian sampling. Then, we provide a finite-time analysis showing the impact of incorporating a target network into actor-critic methods.
翻译:整合目标网络的行为者-批评方法在深层强化学习中表现出了惊人的成功经验。然而,文献中基本上缺乏对使用行为方-批评方法目标网络的理论理解。在本文中,我们通过提议对网上基于目标的行为者-批评算法进行首次理论分析,在折扣奖励设置中以线性功能近似为基准,从而弥合理论与实践之间的差距。我们的算法使用三种不同的时间尺度:一种是行为者,另一种是评论家。我们使用标准单一时间尺度时间尺度差异(TD)学习算法作为批评家,而没有使用两种基于时间尺度的基于目标的TD学习指标版本,这是从实施目标网络的实际行为者-批评算法中密切启发的。首先,我们在Markovian抽样中为评论家和行为者确定非抽象的趋同结果。然后,我们提供一次时间分析,说明将目标网络纳入行为方-批评方法的影响。