Thompson Sampling (TS) is an efficient method for decision-making under uncertainty, where an action is sampled from a carefully prescribed distribution which is updated based on the observed data. In this work, we study the problem of adaptive control of stabilizable linear-quadratic regulators (LQRs) using TS, where the system dynamics are unknown. Previous works have established that $\tilde O(\sqrt{T})$ frequentist regret is optimal for the adaptive control of LQRs. However, the existing methods either work only in restrictive settings, require a priori known stabilizing controllers, or utilize computationally intractable approaches. We propose an efficient TS algorithm for the adaptive control of LQRs, TS-based Adaptive Control, TSAC, that attains $\tilde O(\sqrt{T})$ regret, even for multidimensional systems, thereby solving the open problem posed in Abeille and Lazaric (2018). TSAC does not require a priori known stabilizing controller and achieves fast stabilization of the underlying system by effectively exploring the environment in the early stages. Our result hinges on developing a novel lower bound on the probability that the TS provides an optimistic sample. By carefully prescribing an early exploration strategy and a policy update rule, we show that TS achieves order-optimal regret in adaptive control of multidimensional stabilizable LQRs. We empirically demonstrate the performance and the efficiency of TSAC in several adaptive control tasks.
翻译:Thompson Sampling (TS) 是不确定性下决策的一种有效方法, 在这种方法中, 一项行动是通过谨慎规定的分配方法抽样的, 并根据观察到的数据加以更新。 在这项工作中, 我们用系统动态未知的 TS 研究对可稳定线性水管监管者(LQRs) 的适应性控制问题, 过去的工作已经确定, $ttilde O( sqrt{T}) 常住者遗憾是适应性控制LQRs的最佳方式。 但是, 现有的方法要么只在限制性环境下运作, 需要先知的稳定控制器, 或利用计算上难以操作的方法。 我们建议对LQRs、 TS- 基础适应性控制控制者(TSC) 的适应性控制进行高效的TS 算法 。 我们的结果是: 在早期对Ltilde O(\ qrt{Tr}) 的适应性控制中, 也就是对Ltaltial 的早期稳定性控制做出新的分析。 我们的结果是, 将稳定性战略的概率定下, 。