Value-based methods play a fundamental role in Markov decision processes (MDPs) and reinforcement learning (RL). In this paper, we present a unified control-theoretic framework for analyzing valued-based methods such as value computation (VC), value iteration (VI), and temporal difference (TD) learning (with linear function approximation). Built upon an intrinsic connection between value-based methods and dynamic systems, we can directly use existing convex testing conditions in control theory to derive various convergence results for the aforementioned value-based methods. These testing conditions are convex programs in form of either linear programming (LP) or semidefinite programming (SDP), and can be solved to construct Lyapunov functions in a straightforward manner. Our analysis reveals some intriguing connections between feedback control systems and RL algorithms. It is our hope that such connections can inspire more work at the intersection of system/control theory and RL.
翻译:基于价值的方法在Markov决策流程(MDPs)和强化学习(RL)中发挥着根本作用。在本文中,我们提出了一个统一的控制理论框架,用于分析价值计算(VC)、价值迭代(VI)和时间差异(TD)学习(与线性函数近似)等基于价值的方法和动态系统之间的内在联系。我们可以直接使用控制理论中现有的 convex测试条件,为上述基于价值的方法得出各种趋同结果。这些测试条件以线性编程(LP)或半确定性编程(SDP)的形式构成,可以直接解决,以构建Lyapunov的功能。我们的分析揭示了反馈控制系统和RL算法之间的一些引人注意的联系。我们希望这种联系能够激发系统/控制理论与RL的交叉作用。