In this paper, we propose actor-director-critic, a new framework for deep reinforcement learning. Compared with the actor-critic framework, the director role is added, and action classification and action evaluation are applied simultaneously to improve the decision-making performance of the agent. Firstly, the actions of the agent are divided into high quality actions and low quality actions according to the rewards returned from the environment. Then, the director network is trained to have the ability to discriminate high and low quality actions and guide the actor network to reduce the repetitive exploration of low quality actions in the early stage of training. In addition, we propose an improved double estimator method to better solve the problem of overestimation in the field of reinforcement learning. For the two critic networks used, we design two target critic networks for each critic network instead of one. In this way, the target value of each critic network can be calculated by taking the average of the outputs of the two target critic networks, which is more stable and accurate than using only one target critic network to obtain the target value. In order to verify the performance of the actor-director-critic framework and the improved double estimator method, we applied them to the TD3 algorithm to improve the TD3 algorithm. Then, we carried out experiments in multiple environments in MuJoCo and compared the experimental data before and after the algorithm improvement. The final experimental results show that the improved algorithm can achieve faster convergence speed and higher total return.
翻译:在本文中,我们提出一个深入强化学习的新框架,即行为者-指导-指导-批评,这是一个新的强化学习框架。与行为者-批评框架相比,我们增加了主任的作用,同时运用行动分类和行动评价来提高代理机构的决策绩效。首先,代理机构的行动按照环境回报的回报分为高质量的行动和低质量行动。然后,董事网络接受培训,使其有能力区分高低质量行动,并指导行为者网络在培训的早期阶段减少重复性地探索低质量行动的趋同速度。此外,我们建议改进双重估计方法,以更好地解决强化学习领域的高估问题。对于所使用的两个批评网络,我们为每个评论网络设计了两个目标评论网络,而不是一个。这样,每个评论网络的目标价值可以通过两个目标评论网络产出的平均值来计算,比仅仅使用一个目标批评网络来减少对低质量行动的重复探索。此外,为了核查在强化学习领域应用的双重估计方法更好地解决过高估计的问题。对于所使用的两个批评网络,我们为每个评论者网络设计了两个目标评论者网络设计了两个目标网络,而不是根据从环境得到的回报。这样计算出两个目标评论网络的平均值,比使用一个目标评论网络更稳定、更准确的网络来获得目标值。为了在目标价值。为了核查演员-直接分析框架和最终改进,我们应用到在随后的逻辑中改进了数字分析3 改进了数字分析,我们应用到进进进算算法,然后改进了数字到进进进进进进算法,可以改进后改进了数字。在后改进了两个实验中改进了数字算法,可以改进了数字。