We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon $T$ with fixed and known cost matrices $Q,R$, but unknown and non-stationary dynamics $\{A_t, B_t\}$. The sequence of dynamics matrices can be arbitrary, but with a total variation, $V_T$, assumed to be $o(T)$ and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all $t$, we present an algorithm that achieves the optimal dynamic regret of $\tilde{\mathcal{O}}\left(V_T^{2/5}T^{3/5}\right)$. With piece-wise constant dynamics, our algorithm achieves the optimal regret of $\tilde{\mathcal{O}}(\sqrt{ST})$ where $S$ is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.
翻译:我们考虑在一个固定且已知的成本基数为Q,R$,但未知和非静止动态 $ _A_t,B_t ⁇ $。动态基数的顺序可能是任意的,但总变数为 $V_T$,假定为$(T)美元,对控制器来说是未知的。假设所有美元都有一个稳定序列,但可能为次最佳控制器,因此,我们推出一种算法,这种算法可以实现固定和已知成本基数基数的固定和已知成本基数基数 $(Q), 实现最优化的动态遗憾 $(V_T%2/5}T_t_t ⁇ ) 。 动态基数基数基数的序列序列序列序列序列可能是最优化的 $(V_T) 。 当我们发现开关的数量问题时, 我们的算法是适应性非常数级化的测算策略, 以最近开发的一种方法为基础, 用于背景的多端点调的轨数(Orent) 里基点的变数分析也是不进行最优化的。