This paper addresses the problem of learning the optimal feedback policy for a nonlinear stochastic dynamical system. Feedback policies typically need a high dimensional parametrization, which makes Reinforcement Learning (RL) algorithms that search for an optimum in this large parameter space, sample inefficient and subject to high variance. We propose a "decoupling" principle that drastically reduces the feedback parameter space while still remaining locally optimal. A corollary of this result is a decoupled data-based control (D2C) algorithm for RL: first, an open-loop deterministic trajectory optimization problem is solved using a black-box simulation model of the dynamical system. Then, a linear closed-loop control is developed around this nominal trajectory using the simulation model. Empirical evidence suggests highly significant reduction in training time, as well as the training variance, without compromising on performance, compared to state of the art RL algorithms.
翻译:本文讨论了学习非线性随机动态系统的最佳反馈政策的问题。 反馈政策通常需要高维的准米特化, 这使得“ 强化学习” 算法在这个大的参数空间中寻找最佳的, 抽样效率低, 且有高度差异。 我们提出了一个“ 脱钩” 原则, 大幅削减反馈参数空间, 同时又保持本地最佳状态 。 这一结果的必然结果是, RL 的脱钩数据控制算法( D2C ) : 首先, 使用动态系统的黑盒模拟模型解决了开关确定性轨迹优化问题。 然后, 利用模拟模型围绕这个名义轨迹开发了线性闭路控制。 经验性证据表明, 与 RL 算法的状态相比, 培训时间以及培训差异非常显著地减少, 而不会影响性能。