设计和实现基于强化学习的多试剂障碍避免 (The Design and Realization of Multi-agent Obstacle Avoidance based on Reinforcement Learning)

Intelligence agents and multi-agent systems play important roles in scenes like the control system of grouped drones, and multi-agent navigation and obstacle avoidance which is the foundational function of advanced application has great importance. In multi-agent navigation and obstacle avoidance tasks, the decision-making interactions and dynamic changes of agents are difficult for traditional route planning algorithms or reinforcement learning algorithms with the increased complexity of the environment. The classical multi-agent reinforcement learning algorithm, Multi-agent deep deterministic policy gradient(MADDPG), solved precedent algorithms' problems of having unstationary training process and unable to deal with environment randomness. However, MADDPG ignored the temporal message hidden beneath agents' interaction with the environment. Besides, due to its CTDE technique which let each agent's critic network to calculate over all agents' action and the whole environment information, it lacks ability to scale to larger amount of agents. To deal with MADDPG's ignorance of the temporal information of the data, this article proposes a new algorithm called MADDPG-LSTMactor, which combines MADDPG with Long short term memory (LSTM). By using agent's observations of continuous timesteps as the input of its policy network, it allows the LSTM layer to process the hidden temporal message. Experimental result demonstrated that this algorithm had better performance in scenarios where the amount of agents is small. Besides, to solve MADDPG's drawback of not being efficient in scenarios where agents are too many, this article puts forward a light-weight MADDPG (MADDPG-L) algorithm, which simplifies the input of critic network. The result of experiments showed that this algorithm had better performance than MADDPG when the amount of agents was large.

翻译：多用途无人驾驶飞机群集控制系统、多试剂导航和障碍避免(这是先进应用的基础功能)等场景中,情报代理人和多试剂情报代理人的动态变化具有非常重要的意义。在多用途导航和避免障碍的任务中,代理人的决策互动和动态变化对于传统的路线规划算法或学习算法来说是困难的,环境更加复杂。典型的多剂强化学习算法、多剂深度确定性政策梯度(MADDPG),解决了先例算法的问题,即存在不固定的培训过程和无法处理环境随机性。然而,MADPG忽略了代理人与环境之间隐藏的瞬时信息。此外,由于它的CTDE技术使每个代理人的批评网络难以计算所有代理人的行动和整个环境信息的动态变化,因此它缺乏将MADPG对数据时间信息的无知度认识(MADGGGG) 算法(MADGGPG和LMAGGPGM) 的高级运算法(MLTM ) 其前期记忆(LTLTM) 其前期内, 预算算算算值比MADDDDDD) 的精度要小。将MAGDDDR 的精度推算算法(MADDDDD) 的精度推算算算算法(MADDDDDDDDDD) 的精度在L) 的精度为L) 的精度的精度的精度的精度的精度的精度。