We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximated multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, learned Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), a commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. The simulations illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar performance of both approaches.
翻译:我们为多试剂批量强化学习制定了有效的近似值,即近似多试剂装配的Q迭代(AMAFQI)。我们介绍了我们的方法的详细衍生方法。我们提议进行反复的政策搜索,并表明它产生了一种贪婪的政策,涉及集中型、有学识的Q功能的多重近似值。在每次迭代和政策评估中,AMAFQI要求进行若干计算,其尺度与代理商数量成线,而相近的计算数量则随着配配配装的Q迭代(FQI)的指数增长而成倍增长。AMAFQI的这一属性是设计可移植型多试用方法的基础。我们评估了AMAFQI的性能,并在数字模拟中将其与FQI进行比较。模拟表明,在多试题中使用AMAFQI而不是FQI时,计算时间大幅缩短,并证实了两种方法的类似性能。