We study online control of time-varying linear systems with unknown dynamics in the nonstochastic control model. At a high level, we demonstrate that this setting is \emph{qualitatively harder} than that of either unknown time-invariant or known time-varying dynamics, and complement our negative results with algorithmic upper bounds in regimes where sublinear regret is possible. More specifically, we study regret bounds with respect to common classes of policies: Disturbance Action (SLS), Disturbance Response (Youla), and linear feedback policies. While these three classes are essentially equivalent for LTI systems, we demonstrate that these equivalences break down for time-varying systems. We prove a lower bound that no algorithm can obtain sublinear regret with respect to the first two classes unless a certain measure of system variability also scales sublinearly in the horizon. Furthermore, we show that offline planning over the state linear feedback policies is NP-hard, suggesting hardness of the online learning problem. On the positive side, we give an efficient algorithm that attains a sublinear regret bound against the class of Disturbance Response policies up to the aforementioned system variability term. In fact, our algorithm enjoys sublinear \emph{adaptive} regret bounds, which is a strictly stronger metric than standard regret and is more appropriate for time-varying systems. We sketch extensions to Disturbance Action policies and partial observation, and propose an inefficient algorithm for regret against linear state feedback policies.
翻译:我们研究的是时间变化线性系统的在线控制,其动态在非随机控制模式中并不为人知。 在高层次上,我们证明这种设置比未知的时间变化或已知的时间变化动态更难。我们研究的是,在可能出现亚线性遗憾的制度中,我们用算法的上限来补充我们的负面结果。更具体地说,我们研究的是常见政策类别:骚乱行动(SLS),骚乱反应(Youla)和线性反馈政策。这三个类别基本上等同于LTI系统,但我们证明,这些等值在时间变化系统中是折叠的。我们证明,任何算法都不能在前两个类别获得亚线性遗憾,除非在可能出现亚线性遗憾的制度中有一定的系统变化尺度。此外,我们发现,关于州性线性反馈政策的离线性规划是硬的,表明在线学习问题很棘手。在正面的方面,我们给出了一种高效的算法,即相对于分线性亚线性亚值的反馈政策,相对于分线性亚线性弹性政策, 行动是更强烈的里程级的里程政策。