This is a short comment on the paper "Asymptotically Stable Adaptive-Optimal Control Algorithm With Saturating Actuators and Relaxed Persistence of Excitation" by Vamvoudakis et al. The question of stability of reinforcement learning (RL) agents remains hard and the said work suggested an on-policy approach with a suitable stability property using a technique from adaptive control - a robustifying term to be added to the action. However, there is an issue with this approach to stabilizing RL, which we will explain in this note. Furthermore, Vamvoudakis et al. seems to have made a fallacious assumption on the Hamiltonian under a generic policy. To provide a positive result, we will not only indicate this mistake, but show critic neural network weight convergence under a stochastic, continuous-time environment, provided certain conditions on the behavior policy hold.
翻译:这是Vamvoudakis等人对论文“与饱和活性起动器和放松刺激的持久性相伴的同步适应性-最佳控制值”的简短评论。 增强学习剂的稳定问题仍然很艰巨,所述工作建议采取具有适当稳定属性的上政策办法,使用适应性控制技术,在行动中增加一个强化术语。然而,在稳定RL的这一办法中存在问题,我们将在本说明中加以解释。此外,Vamvoudakis等人似乎在一项通用政策下对汉密尔顿人做出了错误的假设。为了提供积极的结果,我们将不仅指出这一错误,而且显示在一种随机、连续的环境下,批评性神经网络重量在一种随机、连续的环境下趋于一致,但需提供行为政策持有的某些条件。