Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.
翻译:政策梯度(PG)算法被广泛用于强化学习(RL)。然而,PG算法依靠的是利用当地一级更新所学的价值函数,这导致抽样效率有限。在这项工作中,我们建议了一种称为零点-轨道监督政策改进(ZOSPI)的替代方法。ZOSPI利用全球估计价值函数$Q美元,同时保留基于零点政策优化的PG方法在当地的利用。这种学习模式遵循Q-学习模式,但克服了在连续行动空间中有效操作回旋法的困难。它发现少数样本中存在最高值动作。ZOSPI的政策学习有两个步骤:首先,它用一个学习的价值估计器对这些行动进行抽样抽样抽样,然后通过监督学习学会以最高价值来开展行动。我们进一步证明,这种受监督的学习框架可以学习多模式政策。实验表明,ZOSPI在连续控制基准上取得竞争结果,并具有显著的样本效率。