The exploration/exploitation trade-off is an inherent challenge in data-driven and adaptive control. Though this trade-off has been studied for multi-armed bandits, reinforcement learning (RL) for finite Markov chains, and RL for linear control systems; it is less well-studied for learning-based control of nonlinear control systems. A significant theoretical challenge in the nonlinear setting is that, unlike the linear case, there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose in this paper the use of a finite-horizon oracle controller with perfect knowledge of all system parameters as a reference for optimal control actions. First, this allows us to propose a new regret notion with respect to this oracle finite-horizon controller. Second, this allows us to develop learning-based policies that we prove achieve low regret (i.e., square-root regret up to a log-squared factor) with respect to this oracle finite-horizon controller. This policy is developed in the context of learning-based model predictive control (LBMPC). We conduct a statistical analysis to prove finite sample concentration bounds for the estimation step of our policy, and then we perform a control-theoretic analysis using techniques from MPC- and optimization-theory to show this policy ensures closed-loop stability and achieves low regret. We conclude with numerical experiments on a model of heating, ventilation, and air-conditioning (HVAC) systems that show the low regret of our policy in a setting where the cost function is partially-unknown to the controller.
翻译:勘探/开发交易是数据驱动和适应性控制的一个固有挑战。 虽然已经对多武装强盗进行了这一权衡研究, 但对于有限的马尔科夫链和线性控制系统进行了强化学习(RL), 对学习控制非线性控制系统没有很好的研究。 非线性环境的一个重大理论挑战是, 与线性案例不同, 对特定成本和系统参数的优化控制器没有明确的定性。 我们在本文件中提议使用对所有系统参数有完全知识的限定和手动控制器作为最佳控制行动的参照。 首先, 这使得我们能够提出一个新的遗憾概念, 用于对非线性控制系统进行基于学习的控制。 其次, 这使我们能够制定基于学习的政策, 证明我们对这个模型( 即, 平坦性遗憾到一个日志性因素), 我们提议在基于学习的模型预测控制( LBMPC ) 中, 开发这一政策, 我们用一个统计分析, 以部分的精确的精确度分析 来显示我们之前的精确度政策 。