We develop a learning-based control algorithm for unknown dynamical systems under very severe data limitations. Specifically, the algorithm has access to streaming data only from a single and ongoing trial. Despite the scarcity of data, we show -- through a series of examples -- that the algorithm can provide performance comparable to reinforcement learning algorithms trained over millions of environment interactions. It accomplishes such performance by effectively leveraging various forms of side information on the dynamics to reduce the sample complexity. Such side information typically comes from elementary laws of physics and qualitative properties of the system. More precisely, the algorithm approximately solves an optimal control problem encoding the system's desired behavior. To this end, it constructs and refines a differential inclusion that contains the unknown vector field of the dynamics. The differential inclusion, used in an interval Taylor-based method, enables to over-approximate the set of states the system may reach. Theoretically, we establish a bound on the suboptimality of the approximate solution with respect to the case of known dynamics. We show that the longer the trial or the more side information is available, the tighter the bound. Empirically, experiments in a high-fidelity F-16 aircraft simulator and MuJoCo's environments such as the Reacher, Swimmer, and Cheetah illustrate the algorithm's effectiveness.
翻译:我们开发了在非常严重的数据限制下未知动态系统的基于学习的控制算法。具体地说,该算法只能从单一的和持续的试验中获取流数据。尽管数据稀少,但我们通过一系列实例表明,算法可以提供与经过数百万次环境互动培训的强化学习算法相类似的性能。它能够有效地利用关于动态的各种形式的侧面信息来降低样本复杂性。这种侧面信息通常来自系统物理和定性特性的基本定律。更确切地说,该算法可以解决系统所需行为的最佳控制问题。为此,它构建和完善了一个包含未知的动态矢量场的差别包容。在以泰勒为基础的间隔方法中使用的差别包容使得系统可能达到的状态过近。理论上,我们对已知动态情况下的近似解决办法的亚于最优化程度。我们显示,试验或更侧面信息越长,系统所需行为的编码就会更紧凑。在高纤维化F-16级和制式航空系统模拟环境上进行的实验。