Policy-gradient methods are widely used for learning control policies. They can be easily distributed to multiple workers and reach state-of-the-art results in many domains. Unfortunately, they exhibit large variance and subsequently suffer from high-sample complexity since they aggregate gradients over entire trajectories. At the other extreme, planning methods, like tree search, optimize the policy using single-step transitions that consider future lookahead. These approaches have been mainly considered for value-based algorithms. Planning-based algorithms require a forward model and are computationally intensive at each step, but are more sample efficient. In this work, we introduce SoftTreeMax, the first approach that integrates tree-search into policy gradient. Traditionally, gradients are computed for single state-action pairs. Instead, our tree-based policy structure leverages all gradients at the tree leaves in each environment step. This allows us to reduce the variance of gradients by three orders of magnitude and to benefit from better sample complexity compared with standard policy gradient. On Atari, SoftTreeMax demonstrates up to 5x better performance in faster run-time compared with distributed PPO.
翻译:政策梯度方法广泛用于学习控制政策。 它们可以很容易地分布给多个工人,并在许多领域达到最先进的成果。 不幸的是,它们表现出巨大的差异,并随后由于将整个轨迹的梯度综合在一起而具有高模复杂性。 在另一个极端, 规划方法,例如树搜索,利用考虑到未来外观的单步过渡优化政策。 这些方法主要被考虑用于基于价值的算法。 基于规划的算法需要先行模型,并且每个步骤的计算精度要高,但效率更高。 在这项工作中,我们引入了将树研究纳入政策梯度的第一个方法SoftTreeMax。 传统上,为单一州际行动对子计算梯度。 相反,我们基于树的政策结构在每一个环境步骤的树叶上利用所有梯度。 这使我们能够将梯度差异减少三个级,并受益于与标准政策梯度相比更精细的样本复杂度。 在 Atari, SoftTreMax 显示比分布的PPO更快的运行速度达到5x。