Reinforcement learning (RL) agents are widely used for solving complex sequential decision making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they struggle in the offline RL setting, i.e. learning from a static dataset. We show that performance of online algorithms for generalization in RL can be hindered in the offline setting due to poor estimation of similarity between observations. We propose a new theoretically-motivated framework called Generalized Similarity Functions (GSF), which uses contrastive learning to train an offline RL agent to aggregate observations based on the similarity of their expected future behavior, where we quantify this similarity using \emph{generalized value functions}. We show that GSF is general enough to recover existing SSL objectives while also improving zero-shot generalization performance on a complex offline RL benchmark, offline Procgen.
翻译:强化学习(RL)代理物被广泛用于解决复杂的连续决策任务,但在推广培训期间没有看到的情况方面仍然有困难。先前的在线方法表明,在奖励功能之外使用额外的信号可以提高RL代理物的普及能力,即使用自我监督的学习(SSL),他们在离线的RL设置中挣扎,即从静态数据集学习。我们表明,由于对观测的相似性估计不力,因此在离线设置中可以阻碍在RL中通用的在线算法的性能。我们提议了一个新的具有理论动机的框架,称为通用相似功能(GSF),利用对比学习来培训离线的RL代理物,以便根据预期未来行为的相似性汇总观测结果,我们用\emph{一般值函数来量化这种相似性。我们表明,GSF非常一般,足以恢复现有的SL目标,同时改进离线的复杂离线离线基准线的零光一般化性表现。