In this paper, we introduce a novel online model-based reinforcement learning algorithm that uses Unscented Transform to propagate uncertainty for the prediction of the future reward. Previous approaches either approximate the state distribution at each step of the prediction horizon with a Gaussian, or perform Monte Carlo simulations to estimate the rewards. Our method, depending on the number of sigma points employed, can propagate either mean and covariance with minimal points, or higher-order moments with more points similarly to Monte Carlo. The whole framework is implemented as a computational graph for online training. Furthermore, in order to prevent explosion in the number of sigma points when propagating through a generic state-dependent uncertainty model, we add sigma-point expansion and contraction layers to our graph, which are designed using the principle of moment matching. Finally, we propose gradient descent inspired by Sequential Quadratic Programming to update policy parameters in the presence of state constraints. We demonstrate the proposed method with two applications in simulation. The first one designs a stabilizing controller for the cart-pole problem when the dynamics is known with state-dependent uncertainty. The second example, following up on our previous work, tunes the parameters of a control barrier function-based Quadratic Programming controller for a leader-follower problem in the presence of input constraints.
翻译:在本文中,我们引入了一个新的基于模型的在线强化学习算法,它使用未受刺激的变换法来传播预测未来奖赏的不确定性。 以往的方法要么以高山为预测地平线每一步的状态分布, 要么以高山为对象进行蒙特卡洛模拟来估计奖赏。 最后, 我们的方法, 取决于使用的光斑点数量, 可以传播平均和共差, 也可以传播与蒙特卡洛相似的更高阶点。 整个框架作为在线培训的计算图表来实施。 此外, 为了防止在通过通用的依赖于国家的不确定性模型传播时, 将西格玛点的数量爆炸, 我们用时间匹配的原则在图表中添加了西格玛点扩张和收缩层。 最后, 我们建议, 我们的方法, 由序列二次二次的二次的赤道规划来更新政策参数。 我们用两种模拟应用来展示了拟议的方法。 当动态为依赖状态的不确定性所了解时, 第一个框架为马车极问题设计一个稳定控制器。 第二个例子是, 以我们先前的总统制动控制障碍的参数为基础, 调整了我们以前的制动控制障碍的参数。