We study the problem of generating control laws for systems with unknown dynamics. Our approach is to represent the controller and the value function with neural networks, and to train them using loss functions adapted from the Hamilton-Jacobi-Bellman (HJB) equations. In the absence of a known dynamics model, our method first learns the state transitions from data collected by interacting with the system in an offline process. The learned transition function is then integrated to the HJB equations and used to forward simulate the control signals produced by our controller in a feedback loop. In contrast to trajectory optimization methods that optimize the controller for a single initial state, our controller can generate near-optimal control signals for initial states from a large portion of the state space. Compared to recent model-based reinforcement learning algorithms, we show that our method is more sample efficient and trains faster by an order of magnitude. We demonstrate our method in a number of tasks, including the control of a quadrotor with 12 state variables.
翻译:我们的研究是如何为动态不明的系统制定控制法的问题。 我们的方法是代表控制器和神经网络的价值功能, 并使用根据汉密尔顿- 贾科比- 贝尔曼(HJB) 等式改编的损失函数对他们进行培训。 在缺乏已知的动态模型的情况下, 我们的方法首先从通过离线过程与系统互动而收集的数据中学习状态转换。 学习过的过渡功能随后被整合到 HJB 等式中, 并用来在反馈循环中转发我们的控制器产生的控制信号。 相比之下, 我们的控制器可以使用轨迹优化方法, 优化控制器在单一初始状态中优化控制器, 我们的控制器可以从州大部分空间为初始状态生成接近最佳的控制信号。 与最近的基于模型的强化学习算法相比, 我们显示我们的方法更高效, 并且以数量顺序的速度更快。 我们用一系列任务展示了我们的方法, 包括以12个州变量控制 。