CertRL: 将COQ中价值和政策迭接的趋同证据正规化 (CertRL: Formalizing Convergence Proofs for Value and Policy Iteration in Coq)

Reinforcement learning algorithms solve sequential decision-making problems in probabilistic environments by optimizing for long-term reward. The desire to use reinforcement learning in safety-critical settings inspires a recent line of work on formally constrained reinforcement learning; however, these methods place the implementation of the learning algorithm in their Trusted Computing Base. The crucial correctness property of these implementations is a guarantee that the learning algorithm converges to an optimal policy. This paper begins the work of closing this gap by developing a Coq formalization of two canonical reinforcement learning algorithms: value and policy iteration for finite state Markov decision processes. The central results are a formalization of Bellman's optimality principle and its proof, which uses a contraction property of Bellman optimality operator to establish that a sequence converges in the infinite horizon limit. The CertRL development exemplifies how the Giry monad and mechanized metric coinduction streamline optimality proofs for reinforcement learning algorithms. The CertRL library provides a general framework for proving properties about Markov decision processes and reinforcement learning algorithms, paving the way for further work on formalization of reinforcement learning algorithms.

翻译：强化学习算法通过优化长期奖励来解决概率环境中的连续决策问题。在安全关键环境下使用强化学习的愿望激发了最近关于正式限制强化学习的工作线; 但是,这些方法将学习算法的实施置于其信任的计算机基地。这些执行的关键正确性能是保证学习算法与最佳政策相融合。本文开始缩小这一差距的工作, 方法是开发两种强化学习算法的共同正规化: 限定状态Markov 决策程序的价值和政策迭代。中心结果是将贝尔曼的最佳性原则及其证明正规化, 利用贝尔曼最佳性操作器的缩缩缩缩属性来建立无限地平线限制的序列。 CertRL 开发演示了Giry monad 和机械化的立标码简化最佳性证据, 用于强化学习算法。 CertRL 图书馆为证明Markov 决策过程的属性和强化学习算法提供了一个总体框架, 为强化学习算法的进一步正规化工作铺平了道路。