Current deep reinforcement learning (RL) algorithms are still highly task-specific and lack the ability to generalize to new environments. Lifelong learning (LLL), however, aims at solving multiple tasks sequentially by efficiently transferring and using knowledge between tasks. Despite a surge of interest in lifelong RL in recent years, the lack of a realistic testbed makes robust evaluation of LLL algorithms difficult. Multi-agent RL (MARL), on the other hand, can be seen as a natural scenario for lifelong RL due to its inherent non-stationarity, since the agents' policies change over time. In this work, we introduce a multi-agent lifelong learning testbed that supports both zero-shot and few-shot settings. Our setup is based on Hanabi -- a partially-observable, fully cooperative multi-agent game that has been shown to be challenging for zero-shot coordination. Its large strategy space makes it a desirable environment for lifelong RL tasks. We evaluate several recent MARL methods, and benchmark state-of-the-art LLL algorithms in limited memory and computation regimes to shed light on their strengths and weaknesses. This continual learning paradigm also provides us with a pragmatic way of going beyond centralized training which is the most commonly used training protocol in MARL. We empirically show that the agents trained in our setup are able to coordinate well with unseen agents, without any additional assumptions made by previous works.
翻译:目前深层强化学习(RL)算法仍高度针对具体任务,缺乏推广到新环境的能力。但终身学习(LLL)的目的是通过高效率地转让和使用各种任务之间的知识,按顺序解决多重任务。尽管近年来对终身学习(RL)过程的兴趣激增,但缺乏现实的测试床使得难以对终身学习(LLL)算法进行强有力的评价。另一方面,多试剂(MARL)可被视为终身学习(RL)的自然情景,因为其固有的非常态性,因为代理政策随时间变化而变化。在这项工作中,我们引入了一个多剂终身学习测试台,支持零发和少发两种背景。我们的设置基于Hanabi -- -- 一个部分可观测的、充分合作的多剂游戏,已经证明对零发协调具有挑战性。多剂(MARL)的庞大战略空间使它成为终身学习(MAR)任务的理想环境。我们评估了最近的一些ML方法,并用有限的记忆和计算系统来衡量最新水平的LL算法,以说明其长处和弱点。这一持续学习模式以HARC模式为基础,我们在以往的常规培训中以共同的实践模式上展示了我们以往的实验代理人采用的方法。