We present a reinforcement learning algorithm for learning sparse non-parametric controllers in a Reproducing Kernel Hilbert Space. We improve the sample complexity of this approach by imposing a structure of the state-action function through a normalized advantage function (NAF). This representation of the policy enables efficiently composing multiple learned models without additional training samples or interaction with the environment. We demonstrate the performance of this algorithm on learning obstacle-avoidance policies in multiple simulations of a robot equipped with a laser scanner while navigating in a 2D environment. We apply the composition operation to various policy combinations and test them to show that the composed policies retain the performance of their components. We also transfer the composed policy directly to a physical platform operating in an arena with obstacles in order to demonstrate a degree of generalization.
翻译:我们提出了一个强化学习算法,用于在复制 Kernel Hilbert 空间中学习稀少的非参数控制器。我们通过一个正常的优势功能(NAF)来强制建立国家行动功能结构,从而改进这一方法的样本复杂性。这种政策代表法能够有效地组合多种学习模型,而无需额外的培训样本或与环境互动。我们展示了这一算法在对在2D 环境中航行时配备激光扫描仪的机器人进行多次模拟中学习障碍避免政策的性能。我们将构成操作应用到各种政策组合中,并测试这些组合以表明构成的政策保持其组成部分的性能。我们还将构成的政策直接转移到一个有障碍的物理平台,以展示某种程度的普遍性。