Maximum Entropy (MaxEnt) reinforcement learning is a powerful learning paradigm which seeks to maximize return under entropy regularization. However, action entropy does not necessarily coincide with state entropy, e.g., when multiple actions produce the same transition. Instead, we propose to maximize the transition entropy, i.e., the entropy of next states. We show that transition entropy can be described by two terms; namely, model-dependent transition entropy and action redundancy. Particularly, we explore the latter in both deterministic and stochastic settings and develop tractable approximation methods in a near model-free setup. We construct algorithms to minimize action redundancy and demonstrate their effectiveness on a synthetic environment with multiple redundant actions as well as contemporary benchmarks in Atari and Mujoco. Our results suggest that action redundancy is a fundamental problem in reinforcement learning.
翻译:最大增殖(MaxEnt)强化学习是一个强大的学习模式,力求在对英特罗比进行正规化的情况下最大限度地实现回报。然而,行动激素不一定与状态激素相吻合,例如,当多重行动产生相同的转变时。相反,我们提议最大限度地增加过渡性激素,即下一个州的增殖素。我们显示,过渡性激素可以用两个术语来描述:即依赖模型的过渡性激素和动作冗余。特别是,我们在确定性和随机性的环境中探索后者,并在近乎无模式的设置中开发可移动的近似法。我们建立算法,以尽量减少行动冗余,并展示其在合成环境中的有效性,同时在Atari和Mujoco进行多重重复行动和当代基准。我们的结果表明,行动冗余是强化学习的一个根本问题。