Unmanned aerial vehicles (UAVs) are seen as a promising technology to perform a wide range of tasks in wireless communication networks. In this work, we consider the deployment of a group of UAVs to collect the data generated by IoT devices. Specifically, we focus on the case where the collected data is time-sensitive, and it is critical to maintain its timeliness. Our objective is to optimally design the UAVs' trajectories and the subsets of visited IoT devices such as the global Age-of-Updates (AoU) is minimized. To this end, we formulate the studied problem as a mixed-integer nonlinear programming (MINLP) under time and quality of service constraints. To efficiently solve the resulting optimization problem, we investigate the cooperative Multi-Agent Reinforcement Learning (MARL) framework and propose an RL approach based on the popular on-policy Reinforcement Learning (RL) algorithm: Policy Proximal Optimization (PPO). Our approach leverages the centralized training decentralized execution (CTDE) framework where the UAVs learn their optimal policies while training a centralized value function. Our simulation results show that the proposed MAPPO approach reduces the global AoU by at least a factor of 1/2 compared to conventional off-policy reinforcement learning approaches.
翻译:无人驾驶航空飞行器(UAVs)被视为在无线通信网络中执行广泛任务的一种大有希望的技术。在这项工作中,我们考虑部署一组无人驾驶航空器收集IOT设备产生的数据。具体地说,我们侧重于所收集数据具有时间敏感性、对保持其及时性至关重要的案例。我们的目标是以最佳方式设计无人驾驶航空器的轨迹和诸如全球更新时代(AoU)等已访问的IOT装置的子集。为此,我们在服务限制的时间和质量限制下将所研究的问题发展成混合式非线性编程(MINLP ) 。为了有效解决由此产生的优化问题,我们调查多点加强学习合作框架,并根据流行的加强政策学习算法(RL)算法(PPO):政策优度最佳最佳优化化(PPO) 。我们的方法利用集中化培训(CTDE) 框架,UAVS在其中学习最佳政策,同时培训中央强化A/2功能。我们用模拟结果显示在1PO的学习方法上降低最优化的升级。</s>