多机构顾问 Q-学习 (Multi-Agent Advisor Q-Learning)

from arxiv, Paper has been accepted to Journal of Artificial Intelligence Research (JAIR). Please refer to https://jair.org/index.php/jair/article/view/13445 for JAIR version. The new version on arXiv contains some change in formatting to be consistent with the JAIR version

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question that arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online sub-optimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novel Q-learning based algorithms: ADMIRAL - Decision Making (ADMIRAL-DM) and ADMIRAL - Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed-point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

翻译：在过去十年中,多试剂强化空间学习(MARL)取得了显著进展,但仍然存在许多挑战,例如,在广泛部署之前,需要克服高样本复杂性和缓慢地与稳定政策趋同,但需要克服这些挑战,然而,许多现实世界环境实际上已经为制定政策采用了亚最佳或超优方法,所产生的一个有意思的问题是,如何最好地利用顾问等方法来帮助改进多试领域的强化学习。在本文件中,我们提供了一个原则性框架,将网上次最佳顾问的行动建议纳入多试剂环境中。我们描述了在非限制性的一般和随机游戏环境中,调整多智能强化代理(ADMIAL)的问题,并提出了两种基于Q学习的新型算法:ADMIRAL-决策(ADMIAL-DM)和ADMIRAL-顾问评价(ADIRAL-AE),这使我们能够通过适当纳入顾问(ADMIAL-DM)的建议来改进学习,并评估顾问(ADMER-ADM)在不稳健的级别上的效力。