Achieving provable stability in model-free reinforcement learning (RL) remains a challenge, particularly in balancing exploration with rigorous safety. This article introduces MSACL, a framework that integrates exponential stability theory with maximum entropy RL through multi-step Lyapunov certificate learning. Unlike methods relying on complex reward engineering, MSACL utilizes off-policy multi-step data to learn Lyapunov certificates satisfying theoretical stability conditions. By introducing Exponential Stability Labels (ESL) and a $λ$-weighted aggregation mechanism, the framework effectively balances the bias-variance trade-off in multi-step learning. Policy optimization is guided by a stability-aware advantage function, ensuring the learned policy promotes rapid Lyapunov descent. We evaluate MSACL across six benchmarks, including stabilization and nonlinear tracking tasks, demonstrating its superiority over state-of-the-art Lyapunov-based RL algorithms. MSACL achieves exponential stability and rapid convergence under simple rewards, while exhibiting significant robustness to uncertainties and generalization to unseen trajectories. Sensitivity analysis establishes the multi-step horizon $n=20$ as a robust default across diverse systems. By linking Lyapunov theory with off-policy actor-critic frameworks, MSACL provides a foundation for verifiably safe learning-based control. Source code and benchmark environments will be made publicly available.
翻译:在无模型强化学习中实现可证明的稳定性仍具挑战性,尤其在平衡探索与严格安全性方面。本文提出MSACL框架,该框架通过多步李雅普诺夫证书学习,将指数稳定性理论与最大熵强化学习相结合。与依赖复杂奖励工程的方法不同,MSACL利用离策略多步数据学习满足理论稳定性条件的李雅普诺夫证书。通过引入指数稳定性标签与$λ$加权聚合机制,该框架有效平衡了多步学习中的偏差-方差权衡。策略优化由稳定性感知的优势函数引导,确保所学策略促进李雅普诺夫函数的快速下降。我们在包括镇定任务和非线性跟踪任务在内的六个基准测试中评估MSACL,证明其优于当前最先进的基于李雅普诺夫的强化学习算法。MSACL在简单奖励设置下实现了指数稳定性与快速收敛,同时对不确定性具有显著鲁棒性,并能泛化至未见轨迹。敏感性分析确定多步视野$n=20$可作为跨不同系统的鲁棒默认值。通过将李雅普诺夫理论与离策略演员-评论家框架相结合,MSACL为可验证安全的基于学习的控制奠定了基础。源代码与基准环境将公开提供。