In reinforcement learning (RL), when defining a Markov Decision Process (MDP), the environment dynamics is implicitly assumed to be stationary. This assumption of stationarity, while simplifying, can be unrealistic in many scenarios. In the continual reinforcement learning scenario, the sequence of tasks is another source of nonstationarity. In this work, we propose to examine this continual reinforcement learning setting through the block contextual MDP (BC-MDP) framework, which enables us to relax the assumption of stationarity. This framework challenges RL algorithms to handle both nonstationarity and rich observation settings and, by additionally leveraging smoothness properties, enables us to study generalization bounds for this setting. Finally, we take inspiration from adaptive control to propose a novel algorithm that addresses the challenges introduced by this more realistic BC-MDP setting, allows for zero-shot adaptation at evaluation time, and achieves strong performance on several nonstationary environments.
翻译:在强化学习(RL)中,当定义Markov决策程序时,环境动态被暗含地假定为固定的。这种固定性假设虽然简化,但在许多情景中可能是不切实际的。在持续强化学习的情景中,任务的顺序是非固定性的另一个来源。在这项工作中,我们提议通过块状背景的MDP(BC-MDP)框架来审查这种持续强化学习设置,这使我们能够放松对固定性的假设。这个框架挑战RL算法,以便既处理非静止性又处理丰富的观测设置,并通过进一步利用平稳性能,使我们能够研究这一设置的概括性界限。最后,我们从适应性控制中得到灵感,提出一种新的算法,以应对这种更现实的BC-MDP设置所带来的挑战,允许在评价时零点调整,并在一些非静止环境中取得强大的性表现。