Multi-agent reinforcement learning (MARL) is a prevalent learning paradigm for solving stochastic games. In most MARL studies, agents in a game are defined as teammates or enemies beforehand, and the relationships among the agents remain fixed throughout the game. However, in real-world problems, the agent relationships are commonly unknown in advance or dynamically changing. Many multi-party interactions start off by asking: who is on my team? This question arises whether it is the first day at the stock exchange or the kindergarten. Therefore, training policies for such situations in the face of imperfect information and ambiguous identities is an important problem that needs to be addressed. In this work, we develop a novel identity detection reinforcement learning (IDRL) framework that allows an agent to dynamically infer the identities of nearby agents and select an appropriate policy to accomplish the task. In the IDRL framework, a relation network is constructed to deduce the identities of other agents by observing the behaviors of the agents. A danger network is optimized to estimate the risk of false-positive identifications. Beyond that, we propose an intrinsic reward that balances the need to maximize external rewards and accurate identification. After identifying the cooperation-competition pattern among the agents, IDRL applies one of the off-the-shelf MARL methods to learn the policy. To evaluate the proposed method, we conduct experiments on Red-10 card-shedding game, and the results show that IDRL achieves superior performance over other state-of-the-art MARL methods. Impressively, the relation network has the par performance to identify the identities of agents with top human players; the danger network reasonably avoids the risk of imperfect identification. The code to reproduce all the reported results is available online at https://github.com/MR-BENjie/IDRL.
翻译:多剂加固学习(MARL)是解决杂乱游戏的一个普遍学习模式。 在大多数MARL研究中, 游戏中的代理商被定义为团队或敌人, 在整个游戏中代理商之间的关系仍然固定。 但是, 在现实世界的问题中, 代理商的关系通常在预先或动态变化中是未知的。 许多多方互动的起点是问: 谁是我的团队? 问题在于它是股票交易所还是幼儿园的第一天? 因此, 在面临不完善的信息和模糊身份的情况下, 对这种情形的培训政策是一个需要解决的重要问题。 在这项工作中, 我们开发一个新的身份检测强化学习( IDRL) 框架, 使代理商能够动态地推断附近代理商的身份, 并选择完成这项任务的适当政策。 在 IDRL 框架中, 构建一个关系网络, 通过观察代理商的行为来推断其他代理商的身份。 一个风险代码被优化到所有错误的识别风险。 此外, 我们提出一个内在的奖赏, 平衡着一种外部奖赏和准确的识别关系。 在确定合作模式中, 我们用SARDRRR的测试方法, 显示其他的成绩 。</s>