We consider a context-dependent Reinforcement Learning (RL) setting, which is characterized by: a) an unknown finite number of not directly observable contexts; b) abrupt (discontinuous) context changes occurring during an episode; and c) Markovian context evolution. We argue that this challenging case is often met in applications and we tackle it using a Bayesian approach and variational inference. We adapt a sticky Hierarchical Dirichlet Process (HDP) prior for model learning, which is arguably best-suited for Markov process modeling. We then derive a context distillation procedure, which identifies and removes spurious contexts in an unsupervised fashion. We argue that the combination of these two components allows to infer the number of contexts from data thus dealing with the context cardinality assumption. We then find the representation of the optimal policy enabling efficient policy learning using off-the-shelf RL algorithms. Finally, we demonstrate empirically (using gym environments cart-pole swing-up, drone, intersection) that our approach succeeds where state-of-the-art methods of other frameworks fail and elaborate on the reasons for such failures.
翻译:我们考虑的是基于背景的强化学习(RL)设置,其特点是:(a) 数量未知的有限、无法直接观察的环境;(b) 在一个事件过程中发生的突发(不连续)背景变化;(c) Markovian背景演变。我们争辩说,这一具有挑战性的案例常常在应用中遇到,我们用巴伊西亚法和变式推论来处理。我们先采用粘贴的等级分立进程(HDP),然后进行模型学习,这可以说最适合Markov进程建模。然后,我们得出一种环境蒸馏程序,以不受监督的方式查明和消除虚假的环境。我们争辩说,这两个组成部分的结合使得能够从数据中推断出处理背景基本假设的背景数目。然后我们发现最佳政策的代表性,以便利用现成的RL算法进行高效的政策学习。最后,我们从经验上(使用机械环境马车式摇摆动、无人机式、交汇式)证明,我们的方法在其它框架的状态方法失败的地方取得成功,并详细说明失败的原因。