Deep reinforcement learning was instigated with the presence of trust region methods, being scalable and efficient. However, the pessimism of such algorithms, among which it forces to constrain in a trust region by all means, has been proven to suppress the exploration and harm the performance. Exploratory algorithm such as SAC, while utilizes the entropy to encourage exploration, implicitly optimizing another objective yet. We first observed this inconsistency, and therefore put forward an analogous augmentation technique, which combines well with the on-policy algorithms, when a value critic is involved. Surprisingly, the proposed method consistently satisfies the soft policy improvement theorem, while being more extensible. As the analysis advises, it is crucial to control the temperature coefficient to balance the exploration and exploitation. Empirical tests on MuJoCo benchmark tasks show that the agent is heartened towards higher reward regions, and enjoys a finer performance. Furthermore, we verify the exploration bonus of our method on a set of custom environments.
翻译:深入强化学习是随着信任区域方法的出现而激发的,具有可伸缩性和效率,然而,这种算法的悲观主义,其中它迫使信任区域以各种手段加以限制,已被证明是抑制勘探和损害性能的。探索性算法,如SAC,虽然利用这个变温器鼓励勘探,隐含地优化了另一个目标。我们首先观察到了这种不一致,因此提出了一种类似的增强技术,在涉及价值评论员时,它与政策算法相结合。令人惊讶的是,拟议的方法始终符合软政策改进理论,同时更易于推广。分析指出,控制温度系数对于平衡勘探和开发至关重要。穆约科基准任务的经验性测试表明,代理人向更高的奖励区域走来,并且表现优异。此外,我们还核查了我们方法在一套定制环境中的勘探奖金。</s>