In cooperative multi-agent tasks, parameter sharing among agents is a common technique to decrease the number of trainable parameters and shorten training time. The existing value factorization methods adopt the joint transitions to train parameter-sharing individual value networks, $i.e.$, the transitions of all agents are replayed at the same frequency. Due to the discrepancy of learning difficulty among agents, the training level of agents in a team may be inconsistent with the same transition replay frequency leading to limited team performance. To this end, we propose Discriminative Experience Replay (DER), which transfers the minimal training sample from a multi-agent transition to a single-agent transition. It calculates the equivalent individual reward of each single-agent transition and then divides a multi-agent transition into multiple single-agent transitions. After division, DER selects significant single-agent transitions with large TD-error by referring to the single-agent experience replay methods. Our method can be adapted to existing value function decomposition methods. The experimental results show the optimization equivalence before and after division and that our method significantly improves the learning efficiency on the challenging StarCraft II micromanagement task and Multi-Agent Mujoco tasks.
翻译:在多试剂合作任务中,代理商之间共享参数是减少可培训参数数目和缩短培训时间的一种常见技术,现有的价值系数化方法采用联合过渡,以培训参数共享个人价值网络,即美元,所有代理商的过渡以同一频率重复。由于代理商之间学习困难的差异,一个团队中代理商的培训水平可能与同一过渡重播频率不一致,导致团队绩效有限的过渡重播频率不相符。为此,我们提议不同的经验重播(DER),将最低培训样本从多试剂过渡转移至单一代理过渡。它计算每个单一代理商过渡的同等个人报酬,然后将多试剂过渡分为多个单一代理商过渡。在分工后,DER通过提及单试管重播方法,选择与大型TD-ror的重大单一代理商过渡。我们的方法可以适应现有的价值函数分解方法。实验结果显示,在分裂之前和之后,优化等值,我们的方法大大改进了在具有挑战性的StarCraft II号微管理任务和多试管任务方面的学习效率。