学习强化学习中在线适应政策的子空间 (Learning a subspace of policies for online adaptation in Reinforcement Learning)

Deep Reinforcement Learning (RL) is mainly studied in a setting where the training and the testing environments are similar. But in many practical applications, these environments may differ. For instance, in control systems, the robot(s) on which a policy is learned might differ from the robot(s) on which a policy will run. It can be caused by different internal factors (e.g., calibration issues, system attrition, defective modules) or also by external changes (e.g., weather conditions). There is a need to develop RL methods that generalize well to variations of the training conditions. In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time, forcing the agent to adapt to the system's new dynamics. This online adaptation process can be computationally expensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since there is just a single train environment. To do so, we propose an approach where we learn a subspace of policies within the parameter space. This subspace contains an infinite number of policies that are trained to solve the training environment while having different parameter values. As a consequence, two policies in that subspace process information differently and exhibit different behaviors when facing variations of the train environment. Our experiments carried out over a large variety of benchmarks compare our approach with baselines, including diversity-based methods. In comparison, our approach is simple to tune, does not need any extra component (e.g., discriminator) and learns policies able to gather a high reward on unseen environments.

翻译：深加学习( RL) 主要是在培训和测试环境相似的环境中研究。但是在许多实际应用中, 这些环境可能各不相同。例如, 在控制系统中, 政策所学习的机器人可能与政策所要执行的机器人不同。这个在线适应过程可能计算得非常昂贵( 例如, 微调), 并且不能依赖元- RL 技术, 因为只有一个火车环境。要做到这一点, 我们建议一种方法, 我们从参数空间中学习一个子空间的政策。这个子空间包含一个无限的学习政策, 用来解决在火车时间未知的测试环境, 迫使代理人适应系统的新动态。这个在线适应过程可能由不同的内部因素( 如校准问题、系统减肥、缺陷模块) 或外部变化( 如天气条件 ) 。在参数空间中, 我们学习一个子空间政策。这个子空间包含一个无限的学习次数, 用来解决培训环境的简单化环境, 包括不同空间的校准模型, 以及我们不同的校准方法。