Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.
翻译:多剂强化学习通常受到抽样效率低下问题的影响,学习适当的政策需要使用许多数据样本。向外部示威者学习是缓解这一问题的可能解决办法。然而,这一领域的大多数先前办法假定有单一的演示人在场。利用具有环境不同方面专门知识的多种知识来源(即顾问)可以大大加快复杂环境中的学习速度。本文考虑了在多剂强化学习中同时从多个独立顾问学习的问题。该办法利用了两级的Q学习结构,并将这一框架从单一代理人扩大到多剂环境。我们提供了原则性算法,将一组顾问纳入其中,既对各州的顾问进行评价,又随后利用顾问指导行动选择。我们还提供理论趋同和样本复杂性保证。我们实验性地在三个不同的试验台验证我们的方法,并表明我们的算法比基线更能提供更好的业绩,能够有效地将不同顾问的综合专门知识结合起来,并学会忽略错误的建议。