We consider online reinforcement learning in Mean-Field Games (MFGs). Unlike traditional approaches, we alleviate the need for a mean-field oracle by developing an algorithm that approximates the Mean-Field Equilibrium (MFE) using the single sample path of the generic agent. We call this {\it Sandbox Learning}, as it can be used as a warm-start for any agent learning in a multi-agent non-cooperative setting. We adopt a two time-scale approach in which an online fixed-point recursion for the mean-field operates on a slower time-scale, in tandem with a control policy update on a faster time-scale for the generic agent. Given that the underlying Markov Decision Process (MDP) of the agent is communicating, we provide finite sample convergence guarantees in terms of convergence of the mean-field and control policy to the mean-field equilibrium. The sample complexity of the Sandbox learning algorithm is $\tilde{\mathcal{O}}(\epsilon^{-4})$ where $\epsilon$ is the MFE approximation error. This is similar to works which assume access to oracle. Finally, we empirically demonstrate the effectiveness of the sandbox learning algorithm in diverse scenarios, including those where the MDP does not necessarily have a single communicating class.
翻译:我们考虑均场博弈(MFG)中的在线强化学习。与传统方法不同的是,我们通过使用通用代理的单个样本路径开发算法来近似均场平衡点而不需要均场Oracle。我们将其称为{\it Sandbox Learning},因为它可以作为任何在多个智能体非合作设置中学习的代理的热启动。我们采用两个时间尺度的方法,其中均场的在线定点递归在较慢的时间尺度上运作,与通用代理的更快时间尺度的控制策略更新相配合。鉴于代理的基本马尔科夫决策过程(MDP)是通信的,我们提供了有限样本收敛性保证,其收敛性是指均值场和控制策略收敛到均值场平衡点。Sandbox学习算法的采样复杂度为$\tilde{\mathcal{O}}(\epsilon^{-4})$,其中$\epsilon$是MFE近似误差。这类似于假设具有Oracle访问权限的工作。最后,我们通过真实场景展示了Sandbox学习算法的有效性,其中包括MDP不一定具有单一通信类别的场景。