Multi-agent reinforcement learning (MARL) problems are challenging due to information asymmetry. To overcome this challenge, existing methods often require high level of coordination or communication between the agents. We consider two-agent multi-armed bandits (MABs) and Markov decision processes (MDPs) with a hierarchical information structure arising in applications, which we exploit to propose simpler and more efficient algorithms that require no coordination or communication. In the structure, in each step the ``leader" chooses her action first, and then the ``follower" decides his action after observing the leader's action. The two agents observe the same reward (and the same state transition in the MDP setting) that depends on their joint action. For the bandit setting, we propose a hierarchical bandit algorithm that achieves a near-optimal gap-independent regret of $\widetilde{\mathcal{O}}(\sqrt{ABT})$ and a near-optimal gap-dependent regret of $\mathcal{O}(\log(T))$, where $A$ and $B$ are the numbers of actions of the leader and the follower, respectively, and $T$ is the number of steps. We further extend to the case of multiple followers and the case with a deep hierarchy, where we both obtain near-optimal regret bounds. For the MDP setting, we obtain $\widetilde{\mathcal{O}}(\sqrt{H^7S^2ABT})$ regret, where $H$ is the number of steps per episode, $S$ is the number of states, $T$ is the number of episodes. This matches the existing lower bound in terms of $A, B$, and $T$.
翻译:多剂加固学习(MARL) 问题因信息不对称而具有挑战性。 要克服这一挑战, 现有的方法往往需要代理人之间的高度协调或沟通。 我们考虑在应用程序中产生有等级信息结构的双剂多武装土匪和马尔科夫决策程序(MDPs), 我们试图在应用程序中提出更简单、更高效、不需要协调或沟通的算法。 在结构中, “ 领导”首先选择她的行动, 然后“ 追随者” 在观察领导的行动后决定他的行动。 两个代理人都看到取决于他们联合行动的相同报酬( MDP 设置的同一州过渡)。 对于土匪的设置, 我们提出一个等级结构算法, 接近最优化的差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差错错错错错错错差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差差错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错错