A novel Policy Gradient (PG) algorithm, called Matryoshka Policy Gradient (MPG), is introduced and studied, in the context of max-entropy reinforcement learning, where an agent aims at maximising entropy bonuses additional to its cumulative rewards. MPG differs from standard PG in that it trains a sequence of policies to learn finite horizon tasks simultaneously, instead of a single policy for the single standard objective. For softmax policies, we prove convergence of MPG and global optimality of the limit by showing that the only critical point of the MPG objective is the optimal policy; these results hold true even in the case of continuous compact state space. MPG is intuitive, theoretically sound and we furthermore show that the optimal policy of the standard max-entropy objective can be approximated arbitrarily well by the optimal policy of the MPG framework. Finally, we justify that MPG is well suited when the policies are parametrized with neural networks and we provide an simple criterion to verify the global optimality of the policy at convergence. As a proof of concept, we evaluate numerically MPG on standard test benchmarks.
翻译:在最大熵强化学习的背景下介绍了一种新的策略梯度算法,称为Matryoshka策略梯度(MPG),其中代理旨在最大化与其累积奖励相加的熵奖励。MPG与标准PG不同之处在于,它训练一系列策略同时学习有限的Horizon任务,而不是针对单一的标准目标训练单一策略。对于Softmax策略,我们证明了MPG的收敛性,并通过展示MPG目标的唯一临界点是最优策略,证明了其全局最优性;在连续紧致状态空间的情况下,这些结果依然成立。MPG直觉上很好理解,理论上有保障,并且我们进一步证明了标准最大熵目标的最优策略可以通过MPG框架的最优化策略高度近似。最后,我们证明了MPG适用于策略由神经网络参数化的情况,并提供了一个简单的标准来验证收敛时策略的全局最优性。作为概念验证,我们在标准测试基准上对MPG进行了数值评估。