Temporal information is essential to learning effective policies with Reinforcement Learning (RL). However, current state-of-the-art RL algorithms either assume that such information is given as part of the state space or, when learning from pixels, use the simple heuristic of frame-stacking to implicitly capture temporal information present in the image observations. This heuristic is in contrast to the current paradigm in video classification architectures, which utilize explicit encodings of temporal information through methods such as optical flow and two-stream architectures to achieve state-of-the-art performance. Inspired by leading video classification architectures, we introduce the Flow of Latents for Reinforcement Learning (Flare), a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare (i) recovers optimal performance in state-based RL without explicit access to the state velocity, solely with positional state information, (ii) achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite, namely quadruped walk, hopper hop, finger turn hard, pendulum swing, and walker run, and is the most sample efficient model-free pixel-based RL algorithm, outperforming the prior model-free state-of-the-art by 1.9X and 1.5X on the 500k and 1M step benchmarks, respectively, and (iv), when augmented over rainbow DQN, outperforms this state-of-the-art level baseline on 5 of 8 challenging Atari games at 100M time step benchmark.
翻译:时间信息是学习“强化学习”有效政策的关键。然而,当前最先进的RL算法要么假设这些信息是作为州空间的一部分提供,要么在向像素学习时,使用简单的框架拼写法,暗含地捕捉图像观测中出现的时间信息。这种杂乱与视频分类结构目前的范式形成对照,即通过光学流和双流结构等方法对时间信息进行明确的编码,以实现最先进的性能。在领先的视频分类架构的启发下,我们引入了“强化学习”的中层流程(Flare),这是RL的网络架构,通过潜伏矢量差异明确编码时间信息。我们显示,Flare (i) 恢复了基于州际的RL的最佳性能,但没有明确进入州际速度,仅使用定位状态信息,(ii) 实现基于像素流和双流的状态状态控制任务中最先进的状态性能。 在深硬盘控制基准套中,即最快速的步步步行、最快速步步步行、最慢步步步步、最慢的马、最慢步步步步步、最慢的Rtlevle、最慢的Rx级,在最慢的平平级基准水平上,在最慢步进、最上,最慢的平压一级,最慢步进、最慢步式的平压前一级,在州级平压一级,在州级平压一级级平压一级级级级级级级平压水平上恢复。