We consider the problem of online learning in Linear Quadratic Control systems whose state transition and state-action transition matrices $A$ and $B$ may be initially unknown. We devise an online learning algorithm and provide guarantees on its expected regret. This regret at time $T$ is upper bounded (i) by $\widetilde{O}((d_u+d_x)\sqrt{d_xT})$ when $A$ and $B$ are unknown, (ii) by $\widetilde{O}(d_x^2\log(T))$ if only $A$ is unknown, and (iii) by $\widetilde{O}(d_x(d_u+d_x)\log(T))$ if only $B$ is unknown and under some mild non-degeneracy condition ($d_x$ and $d_u$ denote the dimensions of the state and of the control input, respectively). These regret scalings are minimal in $T$, $d_x$ and $d_u$ as they match existing lower bounds in scenario (i) when $d_x\le d_u$ [SF20], and in scenario (ii) [lai1986]. We conjecture that our upper bounds are also optimal in scenario (iii) (there is no known lower bound in this setting). Existing online algorithms proceed in epochs of (typically exponentially) growing durations. The control policy is fixed within each epoch, which considerably simplifies the analysis of the estimation error on $A$ and $B$ and hence of the regret. Our algorithm departs from this design choice: it is a simple variant of certainty-equivalence regulators, where the estimates of $A$ and $B$ and the resulting control policy can be updated as frequently as we wish, possibly at every step. Quantifying the impact of such a constantly-varying control policy on the performance of these estimates and on the regret constitutes one of the technical challenges tackled in this paper.
翻译:我们考虑在不为人知的线形二次曲线控制系统中在线学习的问题,这些系统最初可能不知道州际过渡和州际过渡矩阵$A$和美元B$。我们设计了一个在线学习算法,并为它的预期遗憾提供保证。当美元被美元((d_u+_d_x)\sqrt{O})((d_u+_x)\sqrt{d_x_T})在美元和美元(b)中出现问题,而美元(d_x%2\log(T)可能最初还不清楚。如果只有美元(d_u+_x)\x(T)被美元(美元)上线性交易的在线学习,那么美元(d_x%x)在美元(美元)的直线性交易中, 美元(美元)的直径比值(美元)的直径比值(美元)更低(美元),在目前直线度的估算中(我们所知道的直径直值) 的直径(美元) 直径(美元和直径直径) 直方(我们所知道的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的汇率) 直方的汇率, 直方的汇率的汇率是(我们所处的直方的直方的直方的直方的直方的直方的直方的直方的直方的直方的汇率) 。