Reinforcement learning (RL) aims to find an optimal policy by interaction with an environment. Consequently, learning complex behavior requires a vast number of samples, which can be prohibitive in practice. Nevertheless, instead of systematically reasoning and actively choosing informative samples, policy gradients for local search are often obtained from random perturbations. These random samples yield high variance estimates and hence are sub-optimal in terms of sample complexity. Actively selecting informative samples is at the core of Bayesian optimization, which constructs a probabilistic surrogate of the objective from past samples to reason about informative subsequent ones. In this paper, we propose to join both worlds. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient. Based on the model, the algorithm decides where to query a noisy zeroth-order oracle to improve the gradient estimates. The resulting algorithm is a novel type of policy search method, which we compare to existing black-box algorithms. The comparison reveals improved sample complexity and reduced variance in extensive empirical evaluations on synthetic objectives. Further, we highlight the benefits of active sampling on popular RL benchmarks.
翻译:强化学习(RL)的目的是通过与环境的互动找到最佳政策。因此,学习复杂的行为需要大量样本,这在实践上可能令人望而却步。然而,不是系统地推理,而是积极地选择信息性样本,而是从随机扰动中往往获得当地搜索的政策梯度。这些随机样本得出了高差异估计,因此在抽样复杂性方面是次优的。积极选择信息性样本是巴耶斯优化的核心,它从过去的样本中构建了一个从目标的概率替代物,到信息性随后的原因。在本文中,我们提议加入两个世界。我们利用目标函数及其梯度的概率模型来开发一种算法。根据模型,算法决定从何处调高杂音零级或触角来改进梯度估计数。由此产生的算法是一种新颖的政策搜索方法,我们将它与现有的黑盒算法进行比较。比较表明,从过去的样本复杂性和广泛的合成目标实验性评估中减少了差异。此外,我们强调对流行的RL基准进行积极抽样的好处。