In this paper, we study the Multi-Start Team Orienteering Problem (MSTOP), a mission re-planning problem where vehicles are initially located away from the depot and have different amounts of fuel. We consider/assume the goal of multiple vehicles is to travel to maximize the sum of collected profits under resource (e.g., time, fuel) consumption constraints. Such re-planning problems occur in a wide range of intelligent UAS applications where changes in the mission environment force the operation of multiple vehicles to change from the original plan. To solve this problem with deep reinforcement learning (RL), we develop a policy network with self-attention on each partial tour and encoder-decoder attention between the partial tour and the remaining nodes. We propose a modified REINFORCE algorithm where the greedy rollout baseline is replaced by a local mini-batch baseline based on multiple, possibly non-duplicate sample rollouts. By drawing multiple samples per training instance, we can learn faster and obtain a stable policy gradient estimator with significantly fewer instances. The proposed training algorithm outperforms the conventional greedy rollout baseline, even when combined with the maximum entropy objective.
翻译:在本文中,我们研究了多启动小组的机动性问题(MSTOP),这是一个特派团重新规划的问题,即车辆最初位于仓库之外,燃料量不同;我们考虑/假设多部车辆的目的是旅行,最大限度地增加资源(如时间、燃料)消耗限制下所收集的利润总额;这种再规划问题发生在广泛的智能UAS应用中,因为特派团环境的变化迫使多部车辆的操作与原计划发生改变;为了通过深入强化学习(RL)解决这个问题,我们开发了一个政策网络,在每次部分巡航和其余节点之间自省,部分巡航和编码解码器分解器注意。我们建议修改REINFORCE算法,将贪婪的推出基线改为基于多种、可能非复制样品推出的当地微型批量基线。通过每次培训抽取多个样本,我们可以更快地学习并获得一个稳定的政策梯度估计器,其实例要少得多。拟议的培训算法超越了常规的贪婪推出基线,即使与最大承诺目标相结合。</s>