We study cooperative online learning in stochastic and adversarial Markov decision process (MDP). That is, in each episode, $m$ agents interact with an MDP simultaneously and share information in order to minimize their individual regret. We consider environments with two types of randomness: \emph{fresh} -- where each agent's trajectory is sampled i.i.d, and \emph{non-fresh} -- where the realization is shared by all agents (but each agent's trajectory is also affected by its own actions). More precisely, with non-fresh randomness the realization of every cost and transition is fixed at the start of each episode, and agents that take the same action in the same state at the same time observe the same cost and next state. We thoroughly analyze all relevant settings, highlight the challenges and differences between the models, and prove nearly-matching regret lower and upper bounds. To our knowledge, we are the first to consider cooperative reinforcement learning (RL) with either non-fresh randomness or in adversarial MDPs.
翻译:我们在Stochistic和对抗性Markov决定程序中研究合作在线学习。也就是说,在每集中,百万美元的代理商同时与一个 MDP互动,并分享信息,以尽量减少个人遗憾。我们考虑两种随机环境:\emph{fresh} -- 每种代理商的轨迹都采集了i.d和\emph{n-fresh}的样本,所有代理商都共享实现过程(但每个代理商的轨迹也受其自身行动的影响)。更确切地说,在每集中,每个成本和过渡的实现都是不新鲜随机的,同时在同一状态采取相同行动的代理商遵守同样的成本和下一个状态。我们透彻分析所有相关环境,突出模型之间的挑战和差异,并证明低和上层几乎匹配的遗憾。据我们所知,我们首先考虑的是合作性强化学习(RL),要么是非新鲜随机性的,要么是在对抗性 MDP中。