Multi-agent reinforcement learning (MARL) has been increasingly used in a wide range of safety-critical applications, which require guaranteed safety (e.g., no unsafe states are ever visited) during the learning process.Unfortunately, current MARL methods do not have safety guarantees. Therefore, we present two shielding approaches for safe MARL. In centralized shielding, we synthesize a single shield to monitor all agents' joint actions and correct any unsafe action if necessary. In factored shielding, we synthesize multiple shields based on a factorization of the joint state space observed by all agents; the set of shields monitors agents concurrently and each shield is only responsible for a subset of agents at each step.Experimental results show that both approaches can guarantee the safety of agents during learning without compromising the quality of learned policies; moreover, factored shielding is more scalable in the number of agents than centralized shielding.
翻译:多剂强化学习(MARL)越来越多地用于广泛的安全关键应用,这要求在学习过程中保证安全(例如,从未访问过不安全国家)。 不幸的是,目前的MARL方法没有安全保障。因此,我们为安全MARL提出了两种屏蔽方法。在集中屏蔽中,我们合成了一个单一屏蔽,以监测所有代理人的联合行动,并在必要时纠正任何不安全行动。在集成屏蔽中,我们根据所有代理人观察到的联合国家空间的系数,合成了多重盾牌;一套盾牌监测剂同时并存,每个盾牌只负责每一步的一组代理人。实验结果显示,这两种办法都能够保证代理人在学习期间的安全,同时不损害所学政策的质量;此外,保分层屏蔽在剂数量上比中央屏蔽更容易伸缩。