Reinforcement Learning (RL) generally suffers from poor sample complexity, mostly due to the need to exhaustively explore the state space to find good policies. On the other hand, we postulate that expert knowledge of the system to control often allows us to design simple rules we expect good policies to follow at all times. In this work, we hence propose a simple yet effective modification of continuous actor-critic RL frameworks to incorporate such prior knowledge in the learned policies and constrain them to regions of the state space that are deemed interesting, thereby significantly accelerating their convergence. Concretely, we saturate the actions chosen by the agent if they do not comply with our intuition and, critically, modify the gradient update step of the policy to ensure the learning process does not suffer from the saturation step. On a room temperature control simulation case study, these modifications allow agents to converge to well-performing policies up to one order of magnitude faster than classical RL agents while retaining good final performance.
翻译:总体而言,强化学习(RL)的样本复杂性一般较低,这主要是因为需要彻底探索国家空间以寻找良好的政策。另一方面,我们假设,对控制系统的专家知识往往使我们能够设计出我们期望在任何时候都遵循的良好政策的简单规则。在这项工作中,我们因此建议简单而有效地修改连续的行为者-批评性RL框架,将这种先前的知识纳入所学政策,将其限制在被认为令人感兴趣的国家空间区域,从而大大加快其趋同速度。具体地说,我们饱和了代理人选择的行动,如果它们不符合我们的直觉,并且严格地说,修改政策的梯度更新步骤以确保学习过程不会因饱和步骤而受到影响。在一次室温控制模拟案例研究中,这些修改使代理人能够以比经典RL代理人更快的速度集中到一个水平的运行良好的政策,同时保持良好的最后性能。