We study infinite horizon discounted Mean Field Control (MFC) problems with common noise through the lens of Mean Field Markov Decision Processes (MFMDP). We allow the agents to use actions that are randomized not only at the individual level but also at the level of the population. This common randomization allows us to establish connections between both closed-loop and open-loop policies for MFC and Markov policies for the MFMDP. In particular, we show that there exists an optimal closed-loop policy for the original MFC. Building on this framework and the notion of state-action value function, we then propose reinforcement learning (RL) methods for such problems, by adapting existing tabular and deep RL methods to the mean-field setting. The main difficulty is the treatment of the population state, which is an input of the policy and the value function. We provide convergence guarantees for tabular algorithms based on discretizations of the simplex. Neural network based algorithms are more suitable for continuous spaces and allow us to avoid discretizing the mean field state space. Numerical examples are provided.
翻译:我们通过平均场标决定程序(MFMDP)的透镜来研究具有常见噪音的无限地平线折扣平均场控(MFMC)问题。我们允许代理商使用不仅在个人一级,而且在人口一级随机操作的行动。这种共同随机操作使我们能够在MFMDP的闭环政策和开放环政策与Markov政策之间建立联系。特别是,我们表明原始MFD存在一种最佳闭环政策。在这个框架和国家行动价值功能概念的基础上,我们然后建议用强化学习(RL)的方法来解决这些问题,将现有的表格和深度RL方法调整到平均场设置。主要困难是人口状态的处理,这是政策和价值功能的一种投入。我们为基于简单x离散的表格算法提供了趋同保证。基于神经算法的算法更适合连续的空间,并使我们能够避免平均场空间的离散。我们提供了数字实例。