In many real-world multi-agent cooperative tasks, due to high cost and risk, agents cannot interact with the environment and collect experiences during learning, but have to learn from offline datasets. However, the transition probabilities calculated from the dataset can be much different from the transition probabilities induced by the learned policies of other agents, creating large errors in value estimates. Moreover, the experience distributions of agents' datasets may vary wildly due to diverse behavior policies, causing large difference in value estimates between agents. Consequently, agents will learn uncoordinated suboptimal policies. In this paper, we propose MABCQ, which exploits value deviation and transition normalization to modify the transition probabilities. Value deviation optimistically increases the transition probabilities of high-value next states, and transition normalization normalizes the biased transition probabilities of next states. They together encourage agents to discover potential optimal and coordinated policies. Mathematically, we prove the convergence of Q-learning under the non-stationary transition probabilities after modification. Empirically, we show that MABCQ greatly outperforms baselines and reduces the difference in value estimates between agents.
翻译:在许多现实世界的多代理人合作任务中,由于成本和风险高,代理商无法与环境互动,无法在学习过程中收集经验,而是必须从离线数据集中学习。然而,从数据集计算的过渡概率可能与其他代理商的学习政策引起的过渡概率大不相同,在价值估计方面造成了很大的错误。此外,由于行为政策多种多样,代理商数据集的经验分布可能大不相同,造成不同行为政策之间的价值估计差异很大。因此,代理商将学习不协调的次优化政策。在本文件中,我们建议MABCQ,利用价值偏差和过渡正常化来修改过渡概率。价值偏差乐观地增加了下一州过渡概率,而过渡正常化使下一个州的偏差过渡概率正常化。它们共同鼓励代理商发现潜在的最佳和协调政策。从理论上讲,我们证明在非静止过渡期间学习的概率在修改后会趋同。我们表明,MABCQ大大超出代理人的基线,并缩小了它们之间的价值差异。