持续时间和空间的政策分级和行动者-行动者-行动者学习:理论和算法 (Policy Gradient and Actor-Critic Learning in Continuous Time and Space: Theory and Algorithms)

We study policy gradient (PG) for reinforcement learning in continuous time and space under the regularized exploratory formulation developed by Wang et al. (2020). We represent the gradient of the value function with respect to a given parameterized stochastic policy as the expected integration of an auxiliary running reward function that can be evaluated using samples and the current value function. This effectively turns PG into a policy evaluation (PE) problem, enabling us to apply the martingale approach recently developed by Jia and Zhou (2021) for PE to solve our PG problem. Based on this analysis, we propose two types of the actor-critic algorithms for RL, where we learn and update value functions and policies simultaneously and alternatingly. The first type is based directly on the aforementioned representation which involves future trajectories and hence is offline. The second type, designed for online learning, employs the first-order condition of the policy gradient and turns it into martingale orthogonality conditions. These conditions are then incorporated using stochastic approximation when updating policies. Finally, we demonstrate the algorithms by simulations in two concrete examples.

翻译：我们根据Wang等人(2020年)开发的常规探索方案,研究政策梯度(PG),以便在连续的时间和空间中加强学习;我们作为辅助运行奖赏功能的预期整合,在特定参数化的随机政策中代表价值函数的梯度(PG),可以使用样本和当前值函数进行评估;这实际上将PG转化为政策评价(PE)问题,使我们能够应用Jia和Zhou(2021年)最近开发的马丁格方法,以便PE解决我们的PG问题;根据这项分析,我们为RL提出了两类行为者-critic 算法,我们同时和交替学习和更新价值函数和政策。第一种类型直接基于上述代表,涉及未来的轨迹,因此是离线的。第二种类型是为在线学习而设计的,采用政策梯度的第一阶级条件,将其转换成马丁格或多调性条件。然后在更新政策时采用Stochatic近比法纳入这些条件。最后,我们用两个具体例子来模拟算算算算算法。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【AAAI 2022】一种样本高效的基于模型的保守 actor-critic 算法

专知会员服务

24+阅读 · 2022年1月10日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

【2020密歇根大学论文】基于学习的序列决策算法的公平性综述论文，Fairness in Learning-Based Sequential Decision Algorithms: A Survey

专知会员服务

22+阅读 · 2020年1月15日

【开放书】部分观测动态系统的贝叶斯学习，119页pdf，Bayesian Learning for partially observed dynamical systems

专知会员服务

41+阅读 · 2019年12月27日