利用深强化学习进行无线数据收集多UAV多线路路径规划 (Multi-UAV Path Planning for Wireless Data Harvesting with Deep Reinforcement Learning)

from arxiv, Code available under https://github.com/hbayerlein/uav_data_harvesting, submitted to IEEE JSAC special issue on UAV Communications in 5G and Beyond Networks. arXiv admin note: text overlap with arXiv:2007.00544

Harvesting data from distributed Internet of Things (IoT) devices with multiple autonomous unmanned aerial vehicles (UAVs) is a challenging problem requiring flexible path planning methods. We propose a multi-agent reinforcement learning (MARL) approach that, in contrast to previous work, can adapt to profound changes in the scenario parameters defining the data harvesting mission, such as the number of deployed UAVs, number and position of IoT devices, or the maximum flying time, without the need to perform expensive recomputations or relearn control policies. We formulate the path planning problem for a cooperative, non-communicating, and homogeneous team of UAVs tasked with maximizing collected data from distributed IoT sensor nodes subject to flying time and collision avoidance constraints. The path planning problem is translated into a decentralized partially observable Markov decision process (Dec-POMDP), which we solve by training a double deep Q-network (DDQN) to approximate the optimal UAV control policy. By exploiting global-local maps of the environment that are fed into convolutional layers of the agents, we show that our proposed network architecture enables the agents to cooperate effectively by carefully dividing the data collection task among themselves, adapt to large state spaces, and make movement decisions that balance data collection goals, flight-time efficiency, and navigation constraints.

翻译：从分布式自动无人驾驶飞行器(UAVs)的物品(IOT)装置中采集数据是一个具有挑战性的问题,需要灵活的道路规划方法。我们建议采用多剂强化学习(MARL)方法,与以往的工作不同,该方法可以适应确定数据采集任务的设想参数的深刻变化,如已部署的UAV的数量、IOT装置的数量和位置,或最大飞行时间,而无需执行费用高昂的重置或重复控制政策。我们为一个合作、非交接和同质的UAVs团队制定路径规划问题,该团队的任务是最大限度地从分布式IOT传感器节点收集数据,但受时间飞行和避免碰撞的限制。路径规划问题被转化成一个分散部分可观测的Markov决策程序(Dec-POMDP),我们通过培训双深Q网络(DQQN),以接近最佳UAV控制政策来解决。我们利用全球地方环境地图,将之反馈到代理人之间的相层。我们指出,我们提议的网络结构使代理人能够有效地合作,使飞行效率决定,从而实现数据收集的目标。