学习控制网络连接的、动态未知的子系统,可缩放遗憾 (Scalable regret for learning to control network-coupled subsystems with unknown dynamics)

We consider the problem of controlling an unknown linear quadratic Gaussian (LQG) system consisting of multiple subsystems connected over a network. Our goal is to minimize and quantify the regret (i.e. loss in performance) of our strategy with respect to an oracle who knows the system model. Viewing the interconnected subsystems globally and directly using existing LQG learning algorithms for the global system results in a regret that increases super-linearly with the number of subsystems. Instead, we propose a new Thompson sampling based learning algorithm which exploits the structure of the underlying network. We show that the expected regret of the proposed algorithm is bounded by $\tilde{\mathcal{O}} \big( n \sqrt{T} \big)$ where $n$ is the number of subsystems, $T$ is the time horizon and the $\tilde{\mathcal{O}}(\cdot)$ notation hides logarithmic terms in $n$ and $T$. Thus, the regret scales linearly with the number of subsystems. We present numerical experiments to illustrate the salient features of the proposed algorithm.

翻译：我们考虑了控制由连接网络的多个子系统组成的未知线性二次夸斯山(LQG)系统的问题。我们的目标是最大限度地减少和量化我们对了解系统模型的神器的战略的遗憾(即性能损失)。查看全球互联的子系统,并直接使用现有的LQG全球系统学习算法,结果导致遗憾随着子系统数量的增加而超线性增加。相反,我们提出了一种新的基于汤普森抽样的学习算法,它利用了基础网络的结构。我们表明,所拟议的算法的遗憾在于$\tilde_mathcal{O\\\\ big(n\ sqrt{T}\ big)$,其中美元是子系统的数量, $T$是时间范围, $talt=othalice{O ⁇ (\\\\cdot), 美元表示隐藏了美元和$T$的对数术语。因此,所拟议的算算法的偏差比例与子系统的数目是线性。我们用数字实验来说明拟议的矩阵的特征。