学习多模式最佳行动, 使用 Sparse Gaussian 进程前置程序进行变式政策搜索 (Variational Policy Search using Sparse Gaussian Process Priors for Learning Multimodal Optimal Actions)

Policy search reinforcement learning has been drawing much attention as a method of learning a robot control policy. In particular, policy search using such non-parametric policies as Gaussian process regression can learn optimal actions with high-dimensional and redundant sensors as input. However, previous methods implicitly assume that the optimal action becomes unique for each state. This assumption can severely limit such practical applications as robot manipulations since designing a reward function that appears in only one optimal action for complex tasks is difficult. The previous methods might have caused critical performance deterioration because the typical non-parametric policies cannot capture the optimal actions due to their unimodality. We propose novel approaches in non-parametric policy searches with multiple optimal actions and offer two different algorithms commonly based on a sparse Gaussian process prior and variational Bayesian inference. The following are the key ideas: 1) multimodality for capturing multiple optimal actions and 2) mode-seeking for capturing one optimal action by ignoring the others. First, we propose a multimodal sparse Gaussian process policy search that uses multiple overlapped GPs as a prior. Second, we propose a mode-seeking sparse Gaussian process policy search that uses the student-t distribution for a likelihood function. The effectiveness of those algorithms is demonstrated through applications to object manipulation tasks with multiple optimal actions in simulations.

翻译：强化政策搜索学习作为学习机器人控制政策的一种方法,引起了人们的极大关注。特别是,使用高萨进程回归等非参数政策进行政策搜索,可以学习高斯进程回归等非参数政策,以高维和冗余传感器作为投入来学习最佳行动。然而,先前的方法暗含地假定,最佳行动对每个州来说都是独特的。这一假设可能严重限制机器人操纵等实际应用,因为设计奖励功能时只出现在一项最佳行动中,难以完成复杂任务。先前的方法可能造成严重性能恶化,因为典型的非参数政策无法因其单一性能而取得最佳行动。我们提出了非参数政策搜索的新办法,采用多种最佳行动,我们提出了基于稀少高斯进程前期和变异性巴耶斯猜想的两种常见不同算法。以下是关键想法:(1) 捕获多种最佳行动的多式联运,和(2) 寻求模式寻求通过忽略其他行动来捕捉一种最佳行动。首先,我们提议采用多重重叠的通用政府采购政策搜索。第二,我们提议采用一种探索模式的低度高斯进程政策搜索方法,通常基于稀少的先前和变换政策操作的两种可能性,在学生模拟操作中,即使用已演示的虚拟应用。