We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximate multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, standard Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), one of the most commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. Numerical examples illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar decision-making performance of both approaches.
翻译:我们为多试剂批量强化学习制定了有效的近似值,即近似多试剂装配的Q迭代(QVQI)。我们介绍了我们的方法的详细衍生方法。我们提议进行反复的政策搜索,并表明它产生了一种贪婪的政策,涉及集中式标准Q功能的多重近似值。在每一次迭代和政策评估中,AMAFQI要求进行若干计算,其尺度与代理商数量成线,而相近的计算数量则因配配配的Q迭代(FQI)而成指数增长。AMAFQI的这一属性对于设计可移植式多试用方法至关重要。我们评估了AMAFQI的性能,并在数字模拟中将其与FQI进行比较。数字实例表明,在多试题中使用AFQI而不是FQI时,计算时间大幅缩短,并证实了这两种方法类似的决策性能。