储存系统示范强化学习 (Model-Based Reinforcement Learning for Stochastic Hybrid Systems)

Optimal control of general nonlinear systems is a central challenge in automation. Enabled by powerful function approximators, data-driven approaches to control have recently successfully tackled challenging robotic applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. We consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expectation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine dynamical systems with nonlinear boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract local polynomial feedback controllers from nonlinear experts via behavioral cloning. Finally, we introduce a novel hybrid relative entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid systems and optimizes a set of time-invariant local feedback controllers derived from a local polynomial approximation of a global state-value function.

翻译：对普通非线性系统的优化控制是自动化的一个中心挑战。在强大的功能近似器下, 数据驱动的控制方法最近成功地解决了具有挑战性的机器人应用。但是, 这种方法往往模糊黑盒超分度表达式背后的动态和控制结构, 从而限制我们理解闭环行为的能力。本文采用了非线性建模和控制的混合系统视图, 通过行为性克隆, 使问题具有明确的等级结构, 并将复杂的动态分解为更简单的本地单位。我们考虑了一种新颖的混合模型模式, 捕捉数据的时间结构, 并产生预期- 最大化(EM) 算法, 自动将非线性动态分解成具有非线性线性边际界限的碎裂动态系统。此外, 我们显示, 这些时间序列模型自然会接受一个闭环式扩展, 我们用来从非线性专家那里通过行为性克隆提取本地多线性反馈控制器。最后, 我们引入一种新型的混合相对性相对性政策搜索( Hb- REPS) 方法, 将非线性动态性动态性动态分级性全球级系统和最佳时值的当地时值稳定系统。