Training a multi-agent reinforcement learning (MARL) algorithm is more challenging than training a single-agent reinforcement learning algorithm, because the result of a multi-agent task strongly depends on the complex interactions among agents and their interactions with a stochastic and dynamic environment. We propose an algorithm that boosts MARL training using the biased action information of other agents based on a friend-or-foe concept. For a cooperative and competitive environment, there are generally two groups of agents: cooperative-agents and competitive-agents. In the proposed algorithm, each agent updates its value function using its own action and the biased action information of other agents in the two groups. The biased joint action of cooperative agents is computed as the sum of their actual joint action and the imaginary cooperative joint action, by assuming all the cooperative agents jointly maximize the target agent's value function. The biased joint action of competitive agents can be computed similarly. Each agent then updates its own value function using the biased action information, resulting in a biased value function and corresponding biased policy. Subsequently, the biased policy of each agent is inevitably subjected to recommend an action to cooperate and compete with other agents, thereby introducing more active interactions among agents and enhancing the MARL policy learning. We empirically demonstrate that our algorithm outperforms existing algorithms in various mixed cooperative-competitive environments. Furthermore, the introduced biases gradually decrease as the training proceeds and the correction based on the imaginary assumption vanishes.
翻译:培训多剂强化学习(MARL)算法比培训单一剂强化学习算法更具挑战性,因为多剂任务的结果在很大程度上取决于代理人之间的复杂互动及其与随机和动态环境的相互作用。我们建议一种算法,利用其他代理人基于友-友-友-友-友概念的有偏见的行动信息,促进MAR培训。对于合作和竞争性环境,一般有两类代理人:合作社-代理人和竞争性代理人。在拟议的算法中,每个代理人都利用自己的行动和两个集团中其他代理人的有偏见的行动信息更新其价值功能。合作代理人的有偏见联合行动是作为其实际联合行动和想象中合作联合行动的总和,假设所有合作代理人共同承担最大目标代理人的价值功能。竞争代理人的有偏见联合行动可以作类似的计算。每个代理人然后利用有偏见的行动信息更新自己的价值功能,从而产生有偏见的价值功能和相应的有偏向性政策。随后,每个代理人的有偏见政策必然要建议与其他代理人合作和竞争,从而将合作代理人的有偏见联合行动联合行动联合行动联合行动和想象联合行动联合行动联合行动联合行动的联合行动行动计算出来,从而逐步提高我们的合作代理人和各种形式的培训结果。