One of the main challenges of multi-agent learning lies in establishing convergence of the algorithms, as, in general, a collection of individual, self-serving agents is not guaranteed to converge with their joint policy, when learning concurrently. This is in stark contrast to most single-agent environments, and sets a prohibitive barrier for deployment in practical applications, as it induces uncertainty in long term behavior of the system. In this work, we propose to apply the concept of trapping regions, known from qualitative theory of dynamical systems, to create safety sets in the joint strategy space for decentralized learning. Upon verification of the direction of learning dynamics, the resulting trajectories are guaranteed not to escape such sets, during the learning process. As a result, it is ensured, that despite the uncertainty over convergence of the applied algorithms, learning will never form hazardous joint strategy combinations. We introduce a binary partitioning algorithm for verification of trapping regions in systems with known learning dynamics, and a heuristic sampling algorithm for scenarios where learning dynamics are not known. In addition, via a fixed point argument, we show the existence of a learning equilibrium within a trapping region. We demonstrate the applications to a regularized version of Dirac Generative Adversarial Network, a four-intersection traffic control scenario run in a state of the art open-source microscopic traffic simulator SUMO, and a mathematical model of economic competition.
翻译:多试剂学习的主要挑战之一在于使算法趋于一致,因为一般而言,在同时学习时,个体自利代理器的集合并不能保证与其联合政策趋同。这与大多数单一代理环境形成鲜明对照,并为实际应用设置了令人望而却步的障碍,因为它在系统的长期行为中引起不确定性。在这项工作中,我们提议应用从动态系统定性理论中知道的陷阱区域概念,在联合战略空间中创建安全套,以便分散学习。在核实学习动态的方向后,保证由此形成的轨迹不会在学习过程中逃脱这些组合。因此,我们保证,尽管在应用的算法的趋同上存在不确定性,但学习将永远不会形成危险的联合战略组合。我们采用了一种双向分配算法,用于在已知学习动态的系统中核查陷阱区域,以及用于学习动态不为人知的情景的超常态抽样算法。此外,我们通过固定点论证,在学习动态方向区域中,我们展示了学习平衡的学习平衡。我们保证了在一个陷阱区域内部的微观交通中存在学习平衡。我们展示了正常的数学模型,我们展示了一种常规版本。</s>