One of the key challenges in deploying RL to real-world applications is to adapt to variations of unknown environment contexts, such as changing terrains in robotic tasks and fluctuated bandwidth in congestion control. Existing works on adaptation to unknown environment contexts either assume the contexts are the same for the whole episode or assume the context variables are Markovian. However, in many real-world applications, the environment context usually stays stable for a stochastic period and then changes in an abrupt and unpredictable manner within an episode, resulting in a segment structure, which existing works fail to address. To leverage the segment structure of piecewise stable context in real-world applications, in this paper, we propose a \textit{\textbf{Se}gmented \textbf{C}ontext \textbf{B}elief \textbf{A}ugmented \textbf{D}eep~(SeCBAD)} RL method. Our method can jointly infer the belief distribution over latent context with the posterior over segment length and perform more accurate belief context inference with observed data within the current context segment. The inferred belief context can be leveraged to augment the state, leading to a policy that can adapt to abrupt variations in context. We demonstrate empirically that SeCBAD can infer context segment length accurately and outperform existing methods on a toy grid world environment and Mujuco tasks with piecewise-stable context.
翻译:将 RL 部署到现实世界应用中的关键挑战之一是适应各种未知环境环境环境的变化,例如机器人任务中的地形变化和拥堵控制中的带宽波动。关于适应未知环境环境的现有工作要么假设整个事件的背景相同,要么假设上下文变量相同。然而,在许多现实世界应用中,环境环境环境通常在一个随机时期保持稳定,然后在一个插件中以突然和不可预测的方式发生变化,导致一个片段结构,而现有工作无法解决。为了在现实世界应用中利用片断稳定环境的区段结构,我们在本文件中提议对未知环境环境进行适应,要么假设整个事件的背景相同,要么假设环境变量相同。在许多真实世界应用中,环境环境环境环境环境环境环境环境环境通常保持稳定,我们的方法可以共同推导出在后段长度上方环境的视野分布,并用观察到的背景背景环境环境环境环境背景进行更精确的推导出,在当前的分区中,我们可以推导出环境环境环境环境环境变化。