In this chapter, the regulation of Unmanned Aerial Vehicle (UAV) communication network is investigated in the presence of dynamic changes in the UAV lineup and user distribution. We target an optimal UAV control policy which is capable of identifying the upcoming change in the UAV lineup (quit or join-in) or user distribution, and proactively relocating the UAVs ahead of the change rather than passively dispatching the UAVs after the change. Specifically, a deep reinforcement learning (DRL)-based UAV control framework is developed to maximize the accumulated user satisfaction (US) score for a given time horizon which is able to handle the change in both the UAV lineup and user distribution. The framework accommodates the changed dimension of the state-action space before and after the UAV lineup change by deliberate state transition design. In addition, to handle the continuous state and action space, deep deterministic policy gradient (DDPG) algorithm, which is an actor-critic based DRL, is exploited. Furthermore, to promote the learning exploration around the timing of the change, the original DDPG is adapted into an asynchronous parallel computing (APC) structure which leads to a better training performance in both the critic and actor networks. Finally, extensive simulations are conducted to validate the convergence of the proposed learning approach, and demonstrate its capability in jointly handling the dynamics in UAV lineup and user distribution as well as its superiority over a passive reaction method.
翻译:在本章中,对无人驾驶航空飞行器(无人驾驶飞行器)通信网络的监管进行了调查,因为无人驾驶航空飞行器(无人驾驶飞行器)线路和用户分布发生动态变化。我们的目标是采用最佳无人驾驶飞行器控制政策,该政策能够确定无人驾驶飞行器线路(撤销或加入)或用户分布即将发生的变化,并积极主动地将无人驾驶飞行器提前迁移,而不是在变更后被动地发送无人驾驶飞行器。具体地说,开发了一个基于无人驾驶飞行器的深度强化学习(DRL)控制框架,以最大限度地提高在特定时间范围内累积的用户满意度(美国),从而能够处理无人驾驶飞行器线路和用户分布的变化。这个框架通过审慎的州过渡期设计,适应了UAVA排(放弃或加入)线路或用户分布即将发生的变化,并积极主动地将无人驾驶飞行器迁移。此外,还利用了基于行为体-批评的深度确定性政策梯度算法(DPG)算法,这是围绕该变化的时间安排进行学习探索。此外,最初的DDPG(美国)在用户排序和最终的运行过程中,将改进了用户排序能力,从而展示了用户排序,从而展示了其运行和升级能力。