It is known that reinforcement learning (RL) is data-hungry. To improve sample-efficiency of RL, it has been proposed that the learning algorithm utilize data from 'approximately similar' processes. However, since the process models are unknown, identifying which other processes are similar poses a challenge. In this work, we study this problem in the context of the benchmark Linear Quadratic Regulator (LQR) setting. Specifically, we consider a setting with multiple agents, each corresponding to a copy of a linear process to be controlled. The agents' local processes can be partitioned into clusters based on similarities in dynamics and tasks. Combining ideas from sequential elimination and zeroth-order policy optimization, we propose a new algorithm that performs simultaneous clustering and learning to output a personalized policy (controller) for each cluster. Under a suitable notion of cluster separation that captures differences in closed-loop performance across systems, we prove that our approach guarantees correct clustering with high probability. Furthermore, we show that the sub-optimality gap of the policy learned for each cluster scales inversely with the size of the cluster, with no additional bias, unlike in prior works on collaborative learning-based control. Our work is the first to reveal how clustering can be used in data-driven control to learn personalized policies that enjoy statistical gains from collaboration but do not suffer sub-optimality due to inclusion of data from dissimilar processes. From a distributed implementation perspective, our method is attractive as it incurs only a mild logarithmic communication overhead.
翻译:众所周知,强化学习(RL)对数据需求极大。为提高RL的样本效率,已有研究提出学习算法可利用来自'近似相似'过程的数据。然而,由于过程模型未知,如何识别哪些其他过程具有相似性成为挑战。本工作在线性二次调节器(LQR)这一基准设定下研究该问题。具体而言,我们考虑多智能体场景,每个智能体对应一个待控制的线性过程副本。根据动态特性与任务的相似性,可将各智能体的局部过程划分为若干集群。结合顺序消除与零阶策略优化思想,我们提出一种新算法,能同步执行聚类与学习,为每个集群输出个性化策略(控制器)。在捕捉不同系统间闭环性能差异的集群分离度量条件下,我们证明该方法能以高概率保证正确聚类。此外,研究表明每个集群所学策略的次优性差距与集群规模成反比,且不存在额外偏差——这与先前基于协作学习的控制研究不同。本工作首次揭示数据驱动控制中如何通过聚类学习个性化策略:既能获得协作带来的统计增益,又不会因引入不相似过程的数据而产生次优性。从分布式实现视角看,该方法仅产生轻微的对数级通信开销,具有显著优势。