We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation which involves future trajectories and hence is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.
翻译:我们根据Wang等人(2020年)开发的常规探索方案,研究政策梯度(PG),以便在连续的时间和空间中加强学习;我们作为辅助运行奖赏功能的预期整合,在特定参数化的随机政策中代表价值函数的梯度(PG),可以使用样本和当前值函数进行评估;这实际上将PG转化为政策评价(PE)问题,使我们能够应用Jia和Zhou(2021年)最近开发的马丁格方法,以便PE解决我们的PG问题;根据这项分析,我们为RL提出了两类行为者-critic 算法,我们同时和交替学习和更新价值函数和政策。第一种类型直接基于上述代表,涉及未来的轨迹,因此是离线的。第二种类型是为在线学习而设计的,采用政策梯度的第一阶级条件,将其转换成马丁格或多调性条件。然后在更新政策时采用Stochatic近比法纳入这些条件。最后,我们用两个具体例子来模拟算算算算算法。