Zeroth-order optimization methods and policy gradient based first-order methods are two promising alternatives to solve reinforcement learning (RL) problems with complementary advantages. The former work with arbitrary policies, drive state-dependent and temporally-extended exploration, possess robustness-seeking property, but suffer from high sample complexity, while the latter are more sample efficient but restricted to differentiable policies and the learned policies are less robust. We propose Zeroth-Order Actor-Critic algorithm (ZOAC) that unifies these two methods into an on-policy actor-critic architecture to preserve the advantages from both. ZOAC conducts rollouts collection with timestep-wise perturbation in parameter space, first-order policy evaluation (PEV) and zeroth-order policy improvement (PIM) alternately in each iteration. We evaluate our proposed method on a range of challenging continuous control benchmarks using different types of policies, where ZOAC outperforms zeroth-order and first-order baseline algorithms.
翻译:零级优化方法和基于第一阶的政策梯度方法是解决强化学习(RL)问题并具有互补优势的两个有希望的替代方法,前者涉及任意政策,推动国家独立和暂时延伸的探索,拥有寻求稳健性的财产,但具有很高的样本复杂性,而后者则具有较高的样本效率,但仅限于不同的政策,而所学政策则不太可靠。我们提议零-有机Acor-Critic 算法(ZOAC)将这两种方法统一成一个在政策上的行为者-critic 结构,以保护两者的优势。ZOAC在参数空间、一级政策评价(PEV)和零顺序政策改进(PIM)中以不同方式轮流采用时间顺序进行推广收集。我们用不同种类的政策对一系列具有挑战性的连续控制基准进行评估,而ZOAC则在其中优于零级和一级基线算法。