This paper presents a hierarchical reinforcement learning algorithm constrained by differentiable signal temporal logic. Previous work on logic-constrained reinforcement learning consider encoding these constraints with a reward function, constraining policy updates with a sample-based policy gradient. However, such techniques oftentimes tend to be inefficient because of the significant number of samples required to obtain accurate policy gradients. In this paper, instead of implicitly constraining policy search with sample-based policy gradients, we directly constrain policy search by backpropagating through formal constraints, enabling training hierarchical policies with substantially fewer training samples. The use of hierarchical policies is recognized as a crucial component of reinforcement learning with task constraints. We show that we can stably constrain policy updates, thus enabling different levels of the policy to be learned simultaneously, yielding superior performance compared with training them separately. Experiment results on several simulated high-dimensional robot dynamics and a real-world differential drive robot (TurtleBot3) demonstrate the effectiveness of our approach on five different types of task constraints. Demo videos, code, and models can be found at our project website: https://sites.google.com/view/dscrl
翻译:本文介绍了受不同信号时间逻辑限制的等级强化学习算法。 以往关于逻辑限制强化学习的工作考虑将这些制约因素编码为奖励功能,以基于抽样的政策梯度限制政策更新。 但是,由于获取准确的政策梯度需要大量样本,这些技术往往效率低下。 在本文中,我们不是以基于抽样的政策梯度间接限制政策搜索,而是通过正式限制进行回溯式分析,直接限制政策搜索,使培训等级政策能够以少得多的培训样本进行。 使用等级政策被认为是强化学习的一个关键组成部分,但任务制约。 我们表明,我们可以不断限制政策更新,从而能够同时学习不同层次的政策,从而产生优于单独培训。 几个模拟高维机器人动态和真实世界差异驱动机器人(TurtleBot3)的实验结果显示了我们处理五种不同任务制约的方法的有效性。 Demo 视频、代码和模型可以在我们的项目网站上找到: https://sites.gogle.com/view/dscrl</s>