Non-stationary environments are challenging for reinforcement learning algorithms. If the state transition and/or reward functions change based on latent factors, the agent is effectively tasked with optimizing a behavior that maximizes performance over a possibly infinite random sequence of Markov Decision Processes (MDPs), each of which drawn from some unknown distribution. We call each such MDP a context. Most related works make strong assumptions such as knowledge about the distribution over contexts, the existence of pre-training phases, or a priori knowledge about the number, sequence, or boundaries between contexts. We introduce an algorithm that efficiently learns policies in non-stationary environments. It analyzes a possibly infinite stream of data and computes, in real-time, high-confidence change-point detection statistics that reflect whether novel, specialized policies need to be created and deployed to tackle novel contexts, or whether previously-optimized ones might be reused. We show that (i) this algorithm minimizes the delay until unforeseen changes to a context are detected, thereby allowing for rapid responses; and (ii) it bounds the rate of false alarm, which is important in order to minimize regret. Our method constructs a mixture model composed of a (possibly infinite) ensemble of probabilistic dynamics predictors that model the different modes of the distribution over underlying latent MDPs. We evaluate our algorithm on high-dimensional continuous reinforcement learning problems and show that it outperforms state-of-the-art (model-free and model-based) RL algorithms, as well as state-of-the-art meta-learning methods specially designed to deal with non-stationarity.
翻译:非静止环境对强化学习算法具有挑战性。如果国家过渡和(或)奖励功能基于潜在因素的变化,代理商有效地负责优化一种行为,在可能无限随机的Markov 决策进程(MDPs)中最大限度地提高业绩,每个进程都来自一些未知的分布。我们称每个MDP为背景。大多数相关工作都作出强有力的假设,例如了解背景分布、培训前阶段的存在,或事先了解背景之间的自由度、顺序或界限。如果我们引入一种高效学习非静止环境的政策的算法。它分析可能无限的数据和计算流,即实时的、高度自信变化点的检测数据,反映是否需要创建和部署新的专门政策,或是否以前最优化的政策可以再利用。我们显示:(一)这种算法最大限度地减少在无法预见的模型变化之前的延迟,从而能够作出迅速的反应;以及(二)它控制非静止的警报率,这对于最大限度地减少遗憾。 我们的方法将稳定度和稳定度的模型的模型的模型 构建成一种不朽的模型,我们用来预测一个不朽的模型的模型,我们用来预测一个不朽的模型的模型的模型,用来预测。