We consider online reinforcement learning in Mean-Field Games. In contrast to the existing works, we alleviate the need for a mean-field oracle by developing an algorithm that estimates the mean-field and the optimal policy using a single sample path of the generic agent. We call this Sandbox Learning, as it can be used as a warm-start for any agent operating in a multi-agent non-cooperative setting. We adopt a two timescale approach in which an online fixed-point recursion for the mean-field operates on a slower timescale and in tandem with a control policy update on a faster timescale for the generic agent. Under a sufficient exploration condition, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\mathcal{O}(\epsilon^{-4})$. Finally, we empirically demonstrate effectiveness of the sandbox learning algorithm in a congestion game.
翻译:我们考虑在平均战地运动会中进行在线强化学习。 与现有的工程相比, 我们通过开发一种算法来估计平均战区和最佳政策, 从而减轻对中战区的需求。 我们称之为“ 沙箱学习”, 因为它可以用作在多试剂非合作环境下操作的任何代理人的热点启动器。 我们采用了两种时间尺度方法, 即对中战区进行在线固定点循环, 以较慢的时间尺度运作, 并同时对通用战剂进行控制政策更新, 以更快的时间尺度更新。 在足够的勘探条件下, 我们为平均战区和控制政策的趋同与中战区平衡提供有限的样本趋同保证。 沙箱学习算法的抽样复杂性是 $mathcal{O} (\ epsilon}-4} $。 最后, 我们从经验上证明沙箱学习算法在拥挤游戏中的有效性 。