In many real-world applications, safety constraints for reinforcement learning (RL) algorithms are either unknown or not explicitly defined. We propose a framework that concurrently learns safety constraints and optimal RL policies in such environments, supported by theoretical guarantees. Our approach merges a logically-constrained RL algorithm with an evolutionary algorithm to synthesize signal temporal logic (STL) specifications. The framework is underpinned by theorems that establish the convergence of our joint learning process and provide error bounds between the discovered policy and the true optimal policy. We showcased our framework in grid-world environments, successfully identifying both acceptable safety constraints and RL policies while demonstrating the effectiveness of our theorems in practice.
翻译:----
在许多实际应用中,强化学习算法的安全性约束要么未知,要么没有明确定义。我们提出了一个同时学习安全性约束和最优强化学习策略的框架,在理论上得到保证。我们的方法将逻辑受限制的强化学习算法与进化算法相结合,合成了信号时间逻辑(STL)规范。该框架建立在定理的基础上,证明了我们联合学习过程的收敛性,并提供了发现的策略与真正的最优策略之间的误差界。我们在网格环境中展示了我们的框架,在成功识别可接受的安全约束和RL策略的同时,展示了我们的定理在实践中的有效性。