IoT网络中用于取样高效无人驾驶航空器轨迹设计模型辅助深强化学习 (Model-aided Deep Reinforcement Learning for Sample-efficient UAV Trajectory Design in IoT Networks)

Deep Reinforcement Learning (DRL) is gaining attention as a potential approach to design trajectories for autonomous unmanned aerial vehicles (UAV) used as flying access points in the context of cellular or Internet of Things (IoT) connectivity. DRL solutions offer the advantage of on-the-go learning hence relying on very little prior contextual information. A corresponding drawback however lies in the need for many learning episodes which severely restricts the applicability of such approach in real-world time- and energy-constrained missions. Here, we propose a model-aided deep Q-learning approach that, in contrast to previous work, considerably reduces the need for extensive training data samples, while still achieving the overarching goal of DRL, i.e to guide a battery-limited UAV on an efficient data harvesting trajectory, without prior knowledge of wireless channel characteristics and limited knowledge of wireless node locations. The key idea consists in using a small subset of nodes as anchors (i.e. with known location) and learning a model of the propagation environment while implicitly estimating the positions of regular nodes. Interaction with the model allows us to train a deep Q-network (DQN) to approximate the optimal UAV control policy. We show that in comparison with standard DRL approaches, the proposed model-aided approach requires at least one order of magnitude less training data samples to reach identical data collection performance, hence offering a first step towards making DRL a viable solution to the problem.

翻译：深度强化学习(DRL)作为设计自主无人驾驶飞行器(UAV)飞行接入点的一种潜在方法,在手机或互联网上连接Things(IoT)连接(IOT)时,作为设计自主无人驾驶飞行器(UAV)飞行接入点的潜在方法,日益受到注意。DRL解决方案的优势是,通过即时学习学习,从而依赖很少的先前背景信息。然而,一个相应的缺点是,需要许多学习过程,严重限制了这种方法在现实世界时间和能源受限制的特派团中的适用性。这里,我们提出一种模型辅助的深度Q-学习方法,与以往的工作相比,大大减少了对广泛培训数据样本的需求,同时仍然实现DRL的总体目标,即在没有事先了解无线频道特性和无线节点地点知识的情况下,指导一个无电池限制UAV的高效数据采集轨迹轨迹。关键理念是使用一个小节点组合节点作为锚作为锚(即已知地点),学习一种传播环境模式,同时暗地估计常规无偏向的位置。与模式的相互作用使我们得以在最接近地在最短的路径上对数据库进行比较。