Reinforcement learning methods trained on few environments rarely learn policies that generalize to unseen environments. To improve generalization, we incorporate the inherent sequential structure in reinforcement learning into the representation learning process. This approach is orthogonal to recent approaches, which rarely exploit this structure explicitly. Specifically, we introduce a theoretically motivated policy similarity metric (PSM) for measuring behavioral similarity between states. PSM assigns high similarity to states for which the optimal policies in those states as well as in future states are similar. We also present a contrastive representation learning procedure to embed any state similarity metric, which we instantiate with PSM to obtain policy similarity embeddings (PSEs). We demonstrate that PSEs improve generalization on diverse benchmarks, including LQR with spurious correlations, a jumping task from pixels, and Distracting DM Control Suite.
翻译:在少数环境中受过训练的强化学习方法很少会学习一般与无形环境相适应的政策。为了改进概括化,我们将内在的相继结构纳入强化学习过程。这一方法与最近的做法完全吻合,很少明确利用这一结构。具体地说,我们引入了一种具有理论动机的政策相似性衡量标准(PSM),用于衡量各州的行为相似性。PSM给那些州以及未来各州的最佳政策相似的州规定了非常相似的州。我们还提出了一个对比性代表学习程序,以纳入任何州相似性衡量标准,我们与PSM即时与PSM并用,以获得政策相似性嵌入。我们证明,PSESE改进了不同基准的通用性,包括具有虚假相关性的LQR,这是来自像素的跳跃任务,以及DMC控制套件。