Centralized Training for Decentralized Execution, where training is done in a centralized offline fashion, has become a popular solution paradigm in Multi-Agent Reinforcement Learning. Many such methods take the form of actor-critic with state-based critics, since centralized training allows access to the true system state, which can be useful during training despite not being available at execution time. State-based critics have become a common empirical choice, albeit one which has had limited theoretical justification or analysis. In this paper, we show that state-based critics can introduce bias in the policy gradient estimates, potentially undermining the asymptotic guarantees of the algorithm. We also show that, even if the state-based critics do not introduce any bias, they can still result in a larger gradient variance, contrary to the common intuition. Finally, we show the effects of the theories in practice by comparing different forms of centralized critics on a wide range of common benchmarks, and detail how various environmental properties are related to the effectiveness of different types of critics.
翻译:中央集中化的分散化执行培训是集中化的离线式培训,已成为多机构强化学习的流行解决方案范例。许多这类方法都采取对州级批评者进行批评的形式,因为中央化培训允许进入真正的系统状态,这在培训期间是有用的,尽管在执行时没有这样做。以国家为基础的批评者已经成为一种共同的经验选择,尽管理论上的理由或分析有限。在本文中,我们表明,以国家为基础的批评者可以在政策梯度估计中引入偏见,有可能损害算法的无保障。我们还表明,即使州级批评者不引入任何偏见,它们仍然可能导致更大的梯度差异,与普通直觉相反。 最后,我们通过在广泛的共同基准上比较不同形式的集中化批评者,并详细说明各种环境特性如何与不同类型批评者的效力相关。