Deep reinforcement learning algorithms can perform poorly in real-world tasks due to the discrepancy between source and target environments. This discrepancy is commonly viewed as the disturbance in transition dynamics. Many existing algorithms learn robust policies by modeling the disturbance and applying it to source environments during training, which usually requires prior knowledge about the disturbance and control of simulators. However, these algorithms can fail in scenarios where the disturbance from target environments is unknown or is intractable to model in simulators. To tackle this problem, we propose a novel model-free actor-critic algorithm -- namely, state-conservative policy optimization (SCPO) -- to learn robust policies without modeling the disturbance in advance. Specifically, SCPO reduces the disturbance in transition dynamics to that in state space and then approximates it by a simple gradient-based regularizer. The appealing features of SCPO include that it is simple to implement and does not require additional knowledge about the disturbance or specially designed simulators. Experiments in several robot control tasks demonstrate that SCPO learns robust policies against the disturbance in transition dynamics.
翻译:由于源与目标环境之间的差异,深度强化学习算法在现实世界的任务中可能执行不力。这种差异通常被视为过渡动态中的干扰。许多现有的算法通过模拟扰动并在培训期间将其应用到源环境来学习强有力的政策,通常需要事先了解模拟器的扰动和控制。然而,这些算法在目标环境的扰动是未知的或模拟器中难以建模的情景中可能会失败。为了解决这一问题,我们提议了一种新的无模型的行为者-批评算法 -- -- 即国家-保守政策优化(SPO) -- -- 来学习稳健的政策,而不必事先模拟扰动。具体地说,SCPO将过渡动态中的扰动降低到州空间,然后通过简单的梯度调节器将其接近。SCPO的吸引力特征包括,实施起来很简单,不需要更多关于扰动或专门设计的模拟器的知识。在几个机器人控制任务中进行的实验表明,SCPO学会了应对过渡动态中的扰动的强有力政策。