We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.
翻译:我们研究对具有已知动态的连续时间系统的政策梯度的估计。 通过重新规划连续时间的政策学习, 我们显示它有可能构建一个更高效、更准确的梯度估计器。 通过时间估计器(BPTT)计算精确的梯度, 以粗略的连续时间系统分解。 相反, 我们粗略地估计原始系统中的连续时间梯度。 有了估算连续时间梯度的明确目标, 我们能够分散适应性, 并构建一个更高效的政策梯度估计器, 我们称之为“ 连续时间政策梯度 ” ( CTPG ) 。 我们显示, 以更高效的 CTPG 估计值取代 BPT 政策梯度的结果是, 在各种控制任务和模拟器中学习得更快、更有力。