This work theoretically studies a ubiquitous reinforcement learning policy for controlling the canonical model of continuous-time stochastic linear-quadratic systems. We show that randomized certainty equivalent policy addresses the exploration-exploitation dilemma in linear control systems that evolve according to unknown stochastic differential equations and their operating cost is quadratic. More precisely, we establish square-root of time regret bounds, indicating that randomized certainty equivalent policy learns optimal control actions fast from a single state trajectory. Further, linear scaling of the regret with the number of parameters is shown. The presented analysis introduces novel and useful technical approaches, and sheds light on fundamental challenges of continuous-time reinforcement learning.
翻译:这项工作从理论上研究了一种无处不在的强化学习政策,用于控制连续时间随机随机线性赤道系统的卡通模型。我们表明,随机的确定性等效政策解决线性控制系统中探索-开发两难问题,这些系统根据未知的随机差异方程式演变,其操作成本是二次的。更确切地说,我们建立了时间遗憾界限的平方根,表明随机确定性等政策从单一的轨道上快速学习最佳控制行动。此外,还展示了对参数数的遗憾的线性缩放。 所提供的分析介绍了新的和有用的技术方法,并揭示了持续时间强化学习的基本挑战。