In this paper, we present a reinforcement learning approach to designing a control policy for a "leader" agent that herds a swarm of "follower" agents, via repulsive interactions, as quickly as possible to a target probability distribution over a strongly connected graph. The leader control policy is a function of the swarm distribution, which evolves over time according to a mean-field model in the form of an ordinary difference equation. The dependence of the policy on agent populations at each graph vertex, rather than on individual agent activity, simplifies the observations required by the leader and enables the control strategy to scale with the number of agents. Two Temporal-Difference learning algorithms, SARSA and Q-Learning, are used to generate the leader control policy based on the follower agent distribution and the leader's location on the graph. A simulation environment corresponding to a grid graph with 4 vertices was used to train and validate the control policies for follower agent populations ranging from 10 to 100. Finally, the control policies trained on 100 simulated agents were used to successfully redistribute a physical swarm of 10 small robots to a target distribution among 4 spatial regions.
翻译:在本文中,我们提出了一个强化学习方法,用于设计“领导”剂的控制政策,该剂通过令人厌恶的相互作用,尽可能快地将“追随者”剂的群成群,通过令人厌恶的相互作用,将“追随者”剂的群成群,赶到一个紧密相连的图表上的目标概率分布。领导控制政策是群分布的函数,根据一个普通差异方程的中位场模式随着时间的演变而变化。该政策对每个图形的顶端层的剂群的依赖性,而不是对单个代理活动的依赖性,简化了领导所需的观察,并使控制战略能够与代理体的数量成比例。两种时间-差异学习算法,SASA和Q-Lear学,都用来根据追随者剂分布和领导方在图表上的位置产生领导方的控制政策。一个模拟环境与4个网格图相对应,用于培训和验证10至100个跟踪剂群的管制政策,在100个模拟剂中培训的控制政策被用来成功地将10个小型机器人的体温再分配给4个空间区域的目标分布。