We consider the problem of nonstochastic control with a sequence of quadratic losses, i.e., LQR control. We provide an efficient online algorithm that achieves an optimal dynamic (policy) regret of $\tilde{O}(\text{max}\{n^{1/3} \mathcal{TV}(M_{1:n})^{2/3}, 1\})$, where $\mathcal{TV}(M_{1:n})$ is the total variation of any oracle sequence of Disturbance Action policies parameterized by $M_1,...,M_n$ -- chosen in hindsight to cater to unknown nonstationarity. The rate improves the best known rate of $\tilde{O}(\sqrt{n (\mathcal{TV}(M_{1:n})+1)} )$ for general convex losses and we prove that it is information-theoretically optimal for LQR. Main technical components include the reduction of LQR to online linear regression with delayed feedback due to Foster and Simchowitz (2020), as well as a new proper learning algorithm with an optimal $\tilde{O}(n^{1/3})$ dynamic regret on a family of ``minibatched'' quadratic losses, which could be of independent interest.
翻译:我们考虑的是以一系列二次损失(即LQR控制)为参数的非随机控制问题。 我们提供高效的在线算法, 实现最优化的动态( 政策) 遗憾 $tilde{O} (\ text{max} max ⁇ n ⁇ n ⁇ } ⁇ 2/3}, 1 ⁇ }\\\ cal{TV} (M ⁇ 1}) 美元, 其中$\mathcal{TV} (M ⁇ 1:n} 美元) 是任何扰乱行动政策序列的总变异性, 由 $M_ 1,..., m_n$ -- 在后方观中选择一个高效的在线算法, 以适应未知的不常态。 该率提高了已知的 $\ telde{O} (sqrt{n} (mathcral{TV} (M{N}+1}}} 美元, 用于一般 convevex损失的已知最佳利率, 我们证明对于LQR来说是最佳的信息- 。 主要技术组件包括将LQR 减为在线线性回归, 与Festrolexleval 和Simctrolexlexlexlexy (40) 。