\Episode-based reinforcement learning (ERL) algorithms treat reinforcement learning (RL) as a black-box optimization problem where we learn to select a parameter vector of a controller, often represented as a movement primitive, for a given task descriptor called a context. ERL offers several distinct benefits in comparison to step-based RL. It generates smooth control trajectories, can handle non-Markovian reward definitions, and the resulting exploration in parameter space is well suited for solving sparse reward settings. Yet, the high dimensionality of the movement primitive parameters has so far hampered the effective use of deep RL methods. In this paper, we present a new algorithm for deep ERL. It is based on differentiable trust region layers, a successful on-policy deep RL algorithm. These layers allow us to specify trust regions for the policy update that are solved exactly for each state using convex optimization, which enables policies learning with the high precision required for the ERL. We compare our ERL algorithm to state-of-the-art step-based algorithms in many complex simulated robotic control tasks. In doing so, we investigate different reward formulations - dense, sparse, and non-Markovian. While step-based algorithms perform well only on dense rewards, ERL performs favorably on sparse and non-Markovian rewards. Moreover, our results show that the sparse and the non-Markovian rewards are also often better suited to define the desired behavior, allowing us to obtain considerably higher quality policies compared to step-based RL.
翻译:\ 基于 Episode 的强化学习(ERL) 算法将强化学习(RL) 视为一个黑箱优化问题, 我们学习选择一个控制器的参数矢量, 通常代表着一种运动原始, 用于一个叫做上下文的指定任务描述符。 ERL 提供与基于职级的 RL 相比的几种不同的好处。 它产生平滑的控制轨迹, 能够处理非马尔科维亚的奖励定义, 由此对参数空间的探索非常适合解决稀薄的奖励设置。 然而, 运动原始参数的高度维度迄今阻碍了深层RL 方法的有效利用。 在本文中, 我们为深层的ERL 提供一种新的算法。 它基于不同的信任区域层次, 成功的政策深度 RL 算法 。 这些层次让我们为每个州指定一个完全解决政策更新的可信区域, 使用 convex 优化, 使政策学习到ERL 所需的高度精确度。 我们把基于ERL 原始参数的算法和基于州级的阶梯级的阶程的测算法, 也常常地定义着我们更精确的不精度的不精度的比的不精度的不精度的不精度 。 。 。 我们的测的测的测的测的测的测的测的测的阶程的阶程的阶程的阶程的阶程的阶程的阶程, 。