任意时间后悔保证的线性二次系统控制算法 (Any-Time Regret-Guaranteed Algorithm for Control of Linear Quadratic Systems)

We propose a computationally efficient algorithm that achieves anytime regret of order $\mathcal{O}(\sqrt{t})$, with explicit dependence on the system dimensions and on the solution of the Discrete Algebraic Riccati Equation (DARE). Our approach uses an appropriately tuned regularization and a sufficiently accurate initial estimate to construct confidence ellipsoids for control design. A carefully designed input-perturbation mechanism is incorporated to ensure anytime performance. We develop two variants of the algorithm. The first enforces strong sequential stability, requiring each policy to be stabilizing and successive policies to remain close. This sequential condition helps prevent state explosion at policy update times; however, it results in a suboptimal regret scaling with respect to the DARE solution. Motivated by this limitation, we introduce a second class of algorithms that removes this requirement and instead requires only that each generated policy be stabilizing. Closed-loop stability is then preserved through a dwell-time inspired policy-update rule. This class of algorithms also addresses key shortcomings of most existing approaches which lack explicit high-probability bounds on the state trajectory expressed in system-theoretic terms. Our analysis shows that partially relaxing the sequential-stability requirement yields optimal regret. Finally, our method eliminates the need for any \emph{a priori} bound on the norm of the DARE solution, an assumption required by all existing computationally efficient OFU based algorithms.

翻译：我们提出了一种计算高效的算法，该算法实现了阶为$\mathcal{O}(\sqrt{t})$的任意时间后悔，并明确依赖于系统维度和离散代数Riccati方程（DARE）的解。我们的方法使用适当调整的正则化和足够精确的初始估计来构建用于控制设计的置信椭球。算法中精心设计了一个输入扰动机制以确保任意时间性能。我们开发了该算法的两种变体。第一种强制实施强序贯稳定性，要求每个策略都是稳定的，且连续策略保持接近。这种序贯条件有助于防止策略更新时刻的状态爆炸；然而，这导致关于DARE解的后悔缩放是次优的。受此限制的启发，我们引入了第二类算法，它移除了这一要求，转而只要求每个生成的策略是稳定的。然后，通过一种受驻留时间启发的策略更新规则来保持闭环稳定性。这类算法也解决了大多数现有方法的关键缺陷，这些方法缺乏用系统理论术语表达的状态轨迹的显式高概率界。我们的分析表明，部分放宽序贯稳定性要求可产生最优后悔。最后，我们的方法消除了对DARE解范数的任何先验界的需求，这是所有现有基于计算高效的OFU算法所必需的假设。