In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.
翻译:在本文中,我们提议了一个用于强化学习的最大灵敏灵敏框架(RL),以克服无模型抽样学习中最大灵敏灵通框架(RL)的限制。虽然最大灵敏灵通框架(RL)指导着未来接触高灵敏国家的政策学习,但拟议的最大灵敏灵通框架(最大灵敏灵通框架)旨在学习访问低灵敏国家,并最大限度地增加这些低灵性能国家中的灵敏灵敏,以促进探索。对于一般的马尔科夫决策程序(MDPs)来说,基于勘探和开发的分解,在拟议的最大灵敏灵敏灵通框架(MDPs)下构建了一个高效的算法。数字结果显示,拟议的算法在目前最先进的RL算法中取得了显著的性能改进。