Despite the numerous applications and success of deep reinforcement learning in many control tasks, it still suffers from many crucial problems and limitations, including temporal credit assignment with sparse reward, absence of effective exploration, and a brittle convergence that is extremely sensitive to the hyperparameters of the problem. The problems of deep reinforcement learning in continuous control, along with the success of evolutionary algorithms in facing some of these problems, have emerged the idea of evolutionary reinforcement learning, which attracted many controversies. Despite successful results in a few studies in this field, a proper and fitting solution to these problems and their limitations is yet to be presented. The present study aims to study the efficiency of combining the two fields of deep reinforcement learning and evolutionary computations further and take a step towards improving methods and the existing challenges. The "Evolutionary Deep Reinforcement Learning Using Elite Buffer" algorithm introduced a novel mechanism through inspiration from interactive learning capability and hypothetical outcomes in the human brain. In this method, the utilization of the elite buffer (which is inspired by learning based on experience generalization in the human mind), along with the existence of crossover and mutation operators, and interactive learning in successive generations, have improved efficiency, convergence, and proper advancement in the field of continuous control. According to the results of experiments, the proposed method surpasses other well-known methods in environments with high complexity and dimension and is superior in resolving the mentioned problems and limitations.
翻译:尽管在许多控制任务中应用了大量的加强学习,并且取得了许多深入加强学习的成功,但是,它仍然面临着许多关键问题和限制,包括:时间信用分配,报酬微薄,缺乏有效的探索,以及对于问题超参数极为敏感的微小趋同;在持续控制下深强化学习的问题,以及演化算法在面对其中一些问题时的成功,都产生了进化强化学习的想法,这引起了许多争议;尽管在这一领域的几项研究中取得了成功的结果,但这些问题及其局限性的恰当和适当解决办法仍有待提出;本项研究的目的是研究将深强化学习和进化计算这两个领域结合起来的效率,并朝着改进方法和现有挑战的方向迈出一步;“革命性深入加强学习,使用埃利特·布法尔”的算法通过交互式学习能力和人类大脑的假设结果的启发,引入了一种新的机制;在这一方法中,利用精英缓冲(这种缓冲的灵感来自对人类思想的概括性学习),以及交叉和突变异操作者的存在,以及连续几代相互学习的效率、趋同、趋同的先进方法,以及拟议的在连续的实地的实验中提高了效率、趋前期和较复杂的方法的改进了。