统计物理与强化学习之间的理论联系 (A Theoretical Connection Between Statistical Physics and Reinforcement Learning)

Sequential decision making in the presence of uncertainty and stochastic dynamics gives rise to distributions over state/action trajectories in reinforcement learning (RL) and optimal control problems. This observation has led to a variety of connections between RL and inference in probabilistic graphical models (PGMs). Here we explore a different dimension to this relationship, examining reinforcement learning using the tools and abstractions of statistical physics. The central object in the statistical physics abstraction is the idea of a partition function $\mathcal{Z}$, and here we construct a partition function from the ensemble of possible trajectories that an agent might take in a Markov decision process. Although value functions and $Q$-functions can be derived from this partition function and interpreted via average energies, the $\mathcal{Z}$-function provides an object with its own Bellman equation that can form the basis of alternative dynamic programming approaches. Moreover, when the MDP dynamics are deterministic, the Bellman equation for $\mathcal{Z}$ is linear, allowing direct solutions that are unavailable for the nonlinear equations associated with traditional value functions. The policies learned via these $\mathcal{Z}$-based Bellman updates are tightly linked to Boltzmann-like policy parameterizations. In addition to sampling actions proportionally to the exponential of the expected cumulative reward as Boltzmann policies would, these policies take entropy into account favoring states from which many outcomes are possible.

翻译：在存在不确定性和随机动态的情况下,序列决策会产生在强化学习(RL)和最佳控制问题中州/行动轨迹分布的分布。这一观察已导致RL与概率图形模型(PGMs)的推论之间的各种联系。这里我们探索了这种关系的不同层面,利用统计物理的工具和抽象的统计物理工具来研究强化学习。统计物理抽象学的核心目标是分区函数的构想$\mathcal ⁇ $,在这里,我们从一个代理可能在马尔科夫决策过程中采用的轨迹共和。虽然值函数和$Q函数可以从此分区函数中产生,并且通过平均能量来解释。 $\mathcal ⁇ $(美元)函数提供了一个对象,用它自己的贝尔曼方程式来作为替代动态规划方法的基础。此外,当MDP的动态是确定性, $\macalman 方程式的方程式是线性,允许非线性方程式的直方程式是无法直接找到的解决方案,而这些非线性方程式的精度正方程式政策则会与Bellmax(美元) 的预期的方程动作行动是链接。