The multi-agent setting is intricate and unpredictable since the behaviors of multiple agents influence one another. To address this environmental uncertainty, distributional reinforcement learning algorithms that incorporate uncertainty via distributional output have been integrated with multi-agent reinforcement learning (MARL) methods, achieving state-of-the-art performance. However, distributional MARL algorithms still rely on the traditional $\epsilon$-greedy, which does not take cooperative strategy into account. In this paper, we present a risk-based exploration that leads to collaboratively optimistic behavior by shifting the sampling region of distribution. Initially, we take expectations from the upper quantiles of state-action values for exploration, which are optimistic actions, and gradually shift the sampling region of quantiles to the full distribution for exploitation. By ensuring that each agent is exposed to the same level of risk, we can force them to take cooperatively optimistic actions. Our method shows remarkable performance in multi-agent settings requiring cooperative exploration based on quantile regression appropriately controlling the level of risk.
翻译:多试剂的设置是复杂和不可预测的,因为多种物剂的行为相互影响。为了应对这种环境不确定性,通过分布输出纳入不确定性的分布强化学习算法已经与多剂强化学习方法(MARL)相结合,实现了最先进的性能。然而,分配的MARL算法仍然依赖传统的 $\ epsilon$-greedy,它没有考虑到合作战略。在本文中,我们提出了一个基于风险的探索,通过改变采样分布区,导致合作乐观行为。最初,我们从国家行动价值的上限数中得出期望,这是乐观的行动,并逐步将采样区转向充分分配利用。通过确保每种物剂都面临同样程度的风险,我们可以迫使它们采取合作乐观的行动。我们的方法显示,在需要合作探索的多剂环境中,需要合作性地根据微量回归适当控制风险水平进行勘探。</s>