Deep Reinforcement Learning (DRL) has the potential to be used for synthesizing feedback controllers (agents) for various complex systems with unknown dynamics. These systems are expected to satisfy diverse safety and liveness properties best captured using temporal logic. In RL, the reward function plays a crucial role in specifying the desired behaviour of these agents. However, the problem of designing the reward function for an RL agent to satisfy complex temporal logic specifications has received limited attention in the literature. To address this, we provide a systematic way of generating rewards in real-time by using the quantitative semantics of Signal Temporal Logic (STL), a widely used temporal logic to specify the behaviour of cyber-physical systems. We propose a new quantitative semantics for STL having several desirable properties, making it suitable for reward generation. We evaluate our STL-based reinforcement learning mechanism on several complex continuous control benchmarks and compare our STL semantics with those available in the literature in terms of their efficacy in synthesizing the controller agent. Experimental results establish our new semantics to be the most suitable for synthesizing feedback controllers for complex continuous dynamical systems through reinforcement learning.
翻译:深度强化学习(DRL) 有可能用于合成各种动态不明的复杂系统的反馈控制器(代理) 。 这些系统预计能满足使用时间逻辑最能捕捉到的多种安全和活性特性。 在 RL 中, 奖励功能在具体指定这些代理器的预期行为方面发挥着关键作用。 然而, 设计一个 RL 代理器的奖励功能以满足复杂的时间逻辑规格的问题在文献中受到的关注有限。 为此, 我们通过使用信号时空逻辑(STL)的定量语义(STL) 提供一种系统化的实时奖励方法。 这是用来指定网络物理系统行为的一种广泛使用的时间逻辑。 我们为STL 提出了一个新的定量语义, 具有几种可取的属性, 适合于产生奖励。 我们根据若干复杂的连续控制基准来评估基于STL 的强化学习机制, 并将我们的STL 语义与文献中可用的语义在合成控制器的功效上进行比较。 实验结果将我们的新语义确定我们最适于通过学习复杂的连续动态强化系统合成反馈控制器。