We study finite-time horizon continuous-time linear-quadratic reinforcement learning problems in an episodic setting, where both the state and control coefficients are unknown to the controller. We first propose a least-squares algorithm based on continuous-time observations and controls, and establish a logarithmic regret bound of order $O((\ln M)(\ln\ln M))$, with $M$ being the number of learning episodes. The analysis consists of two parts: perturbation analysis, which exploits the regularity and robustness of the associated Riccati differential equation; and parameter estimation error, which relies on sub-exponential properties of continuous-time least-squares estimators. We further propose a practically implementable least-squares algorithm based on discrete-time observations and piecewise constant controls, which achieves similar logarithmic regret with an additional term depending explicitly on the time stepsizes used in the algorithm.
翻译:我们在一个偶发环境中研究有限时间跨地平线线性强化学习问题,因为控制器不知道州和控制系数。我们首先提出基于连续时间观察和控制的最不平方算法,并建立一个以美元((xln M)(xln\ln))为单位的对数遗憾组合,以美元为学习事件的数量为单位。分析由两部分组成:扰动分析,它利用了相关里卡提差等式的规律性和稳健性;参数估计错误,它依赖连续时间最小方程估计器的亚增益特性。我们还根据离子时间观察和小数常数控制,提出了一种实际可实施的最不平方算法,它实现类似的对数遗憾,另一个术语则明确取决于算法中使用的时间级。