Mean field game facilitates analyzing multi-armed bandit (MAB) for a large number of agents by approximating their interactions with an average effect. Existing mean field models for multi-agent MAB mostly assume a binary reward function, which leads to tractable analysis but is usually not applicable in practical scenarios. In this paper, we study the mean field bandit game with a continuous reward function. Specifically, we focus on deriving the existence and uniqueness of mean field equilibrium (MFE), thereby guaranteeing the asymptotic stability of the multi-agent system. To accommodate the continuous reward function, we encode the learned reward into an agent state, which is in turn mapped to its stochastic arm playing policy and updated using realized observations. We show that the state evolution is upper semi-continuous, based on which the existence of MFE is obtained. As the Markov analysis is mainly for the case of discrete state, we transform the stochastic continuous state evolution into a deterministic ordinary differential equation (ODE). On this basis, we can characterize a contraction mapping for the ODE to ensure a unique MFE for the bandit game. Extensive evaluations validate our MFE characterization, and exhibit tight empirical regret of the MAB problem.
翻译:普通场域游戏有助于分析大量代理商的多武装土匪(MAB),其方式是接近其平均效果的相互作用,从而对大量代理商进行多武装土匪(MAB)分析。多剂商的现有平均现场模型大多具有二进制的奖励功能,这会导致可移植的分析,但通常不适用于实际情景。在本文中,我们用连续的奖励功能来研究普通野匪(MFE)游戏。具体地说,我们的重点是从平均野外平衡(MFE)的存在和独特性出发,从而保证多试剂系统无症状的稳定性。为了适应连续的奖赏功能,我们把学到的奖赏编码成一个代理商状态,然后将其绘制成其精密的手臂执行政策并用已实现的观察加以更新。我们表明,国家的演变是上半连续性的,以此为基础取得MFE的存在。由于Markov的分析主要是针对离散状态的情况,我们把随机连续的状态演变变成一种普通的确定性差异方程式(ODE)。在此基础上,我们可以给ODE绘制收缩图,以确保磁盘游戏的独特MFE,并验证我们的磁感。