通过最大限度地增加相互信息指标,对多车辆跟踪团队对团队学习方法持反对态度者-软件强化学习方法 (An Opponent-Aware Reinforcement Learning Method for Team-to-Team Multi-Vehicle Pursuit via Maximizing Mutual Information Indicator)

The pursuit-evasion game in Smart City brings a profound impact on the Multi-vehicle Pursuit (MVP) problem, when police cars cooperatively pursue suspected vehicles. Existing studies on the MVP problems tend to set evading vehicles to move randomly or in a fixed prescribed route. The opponent modeling method has proven considerable promise in tackling the non-stationary caused by the adversary agent. However, most of them focus on two-player competitive games and easy scenarios without the interference of environments. This paper considers a Team-to-Team Multi-vehicle Pursuit (T2TMVP) problem in the complicated urban traffic scene where the evading vehicles adopt the pre-trained dynamic strategies to execute decisions intelligently. To solve this problem, we propose an opponent-aware reinforcement learning via maximizing mutual information indicator (OARLM2I2) method to improve pursuit efficiency in the complicated environment. First, a sequential encoding-based opponents joint strategy modeling (SEOJSM) mechanism is proposed to generate evading vehicles' joint strategy model, which assists the multi-agent decision-making process based on deep Q-network (DQN). Then, we design a mutual information-united loss, simultaneously considering the reward fed back from the environment and the effectiveness of opponents' joint strategy model, to update pursuing vehicles' decision-making process. Extensive experiments based on SUMO demonstrate our method outperforms other baselines by 21.48% on average in reducing pursuit time. The code is available at \url{https://github.com/ANT-ITS/OARLM2I2}.

翻译：在Smart City的追逐避险游戏对机动车辆合作追逐可疑车辆(MVP)问题产生了深刻影响。关于机动车辆问题的现有研究往往会让车辆随意或按固定的路线移动。对手模型方法证明在应对对手代理人造成的非静止状态方面有很大的希望。然而,它们大多侧重于双人竞技游戏和不受环境干扰的简单情景。本文认为,在复杂的城市交通舞台上,机动车辆采取预先训练的动态战略,明智地执行决定。为了解决这个问题,我们建议通过最大限度地提高相互信息指标(OARLM2)的方法来强化对手的学习,以提高在复杂环境中的追求效率。首先,建议采用基于编码的反对者联合战略模型(SEOJSMM)机制,以产生蒸发车辆联合战略模型(T2TMVP),在深度的追逐中采用预先训练的动态战略来协助多代理人的决策过程。在深度的追逐2AR网络上采用预先训练的动态战略。我们同时提出一个反向对手学习的强化学习学习方法,然后在深度计算模型/网络上,通过双向模型更新我们的标准计算,然后设计一个模拟的计算方法。我们的标准计算,然后设计一个模拟的模型,从正在更新的计算中, 测试的计算出其他的计算中, 测试的模型的模型的计算出一个测试方法。