We consider the problem where $M$ agents interact with $M$ identical and independent environments with $S$ states and $A$ actions using reinforcement learning for $T$ rounds. The agents share their data with a central server to minimize their regret. We aim to find an algorithm that allows the agents to minimize the regret with infrequent communication rounds. We provide \NAM\ which runs at each agent and prove that the total cumulative regret of $M$ agents is upper bounded as $\Tilde{O}(DS\sqrt{MAT})$ for a Markov Decision Process with diameter $D$, number of states $S$, and number of actions $A$. The agents synchronize after their visitations to any state-action pair exceeds a certain threshold. Using this, we obtain a bound of $O\left(MSA\log(MT)\right)$ on the total number of communications rounds. Finally, we evaluate the algorithm against multiple environments and demonstrate that the proposed algorithm performs at par with an always communication version of the UCRL2 algorithm, while with significantly lower communication.
翻译:我们考虑了美元代理商与美元相同和独立的环境互动的问题。 美元代理商与美元相同和独立的环境互动, 美元代理商与美元美元交易。 美元交易商与中央服务器共享数据, 以最大限度地减少他们的遗憾。 我们的目标是找到一种算法, 使代理商能够以不定期的通信回合最大限度地减少遗憾。 我们提供由每个代理商经营的 NAM\, 并证明美元代理商的累计遗憾总额以美元/ Tilde{O} (DS\ sqrt{MAT}) 以美元( 美元/ 美元) 为最高约束, 用于一个直径为美元、 数为美元、 数为美元和数为美元的行动。 代理商在访问任何州- 行动对子后, 同步超过一定的门槛 。 使用这个算法, 我们从通信回合总数中获得了 $left( MSA\log) right) 的约束。 最后, 我们根据多个环境评估了算法, 并证明拟议的算算法与UCRL2算法的始终通信版本相同, 而通信量要大大降低。