The exploration/exploitation trade-off is an inherent challenge in data-driven adaptive control. Though this trade-off has been studied for multi-armed bandits (MAB's) and reinforcement learning for linear systems; it is less well-studied for learning-based control of nonlinear systems. A significant theoretical challenge in the nonlinear setting is that there is no explicit characterization of an optimal controller for a given set of cost and system parameters. We propose the use of a finite-horizon oracle controller with full knowledge of parameters as a reasonable surrogate to optimal controller. This allows us to develop policies in the context of learning-based MPC and MAB's and conduct a control-theoretic analysis using techniques from MPC- and optimization-theory to show these policies achieve low regret with respect to this finite-horizon oracle. Our simulations exhibit the low regret of our policy on a heating, ventilation, and air-conditioning model with partially-unknown cost function.
翻译:勘探/开发权衡是数据驱动的适应性控制的一个固有挑战。虽然对多武装强盗(MAB's)和线性系统的强化学习进行了这一权衡研究;对非线性系统以学习为基础控制非线性系统的研究较少;非线性环境的一个重大理论挑战是,对特定一套费用和系统参数没有明确确定最佳控制器的特征;我们提议使用对参数完全了解的有限对等控制器作为最佳控制器的合理替代器。这使我们能够在基于学习的MPC和MAB's的范围内制定政策,并利用MPC的技术以及优化理论进行控制理论分析,以表明这些政策对于这一有限对等离子体或项的低遗憾程度。我们的模拟显示了我们对取暖、通风和空调模式的政策的低遗憾程度,其部分成本功能是未知的。