The predominant approach in reinforcement learning is to assign credit to actions based on the expected return. However, we show that the return may depend on the policy in a way which could lead to excessive variance in value estimation and slow down learning. Instead, we show that the advantage function can be interpreted as causal effects and shares similar properties with causal representations. Based on this insight, we propose Direct Advantage Estimation (DAE), a novel method that can model the advantage function and estimate it directly from on-policy data while simultaneously minimizing the variance of the return without requiring the (action-)value function. We also relate our method to Temporal Difference methods by showing how value functions can be seamlessly integrated into DAE. The proposed method is easy to implement and can be readily adapted by modern actor-critic methods. We evaluate DAE empirically on three discrete control domains and show that it can outperform generalized advantage estimation (GAE), a strong baseline for advantage estimation, on a majority of the environments when applied to policy optimization.
翻译:强化学习的主要方法是根据预期回报情况对行动进行信用分配。然而,我们表明,回报可能取决于政策,其方式可能导致价值估计的过度差异,减缓学习速度。相反,我们表明,优势功能可被解释为因果关系效应,与因果表现具有相似的属性。我们提出直接优势估算(DAE),这是一种创新方法,可以模拟优势功能,直接根据政策数据直接估算优势功能,同时根据政策数据进行估算,同时在不要求(行动)价值功能的情况下尽可能缩小回报差异。我们还将我们的方法与时间差异方法联系起来,表明如何将价值功能无缝地纳入DAE。拟议的方法易于实施,并且可以通过现代的行为体激励方法加以随时调整。我们从三个独立的控制领域对DAE进行了经验评估,并表明,在政策优化时,它能够超过优势估计的强基准,即优势估计,在大多数环境中,它能够超过普遍优势估计。