Deep Reinforcement Learning (RL) is successful in solving many complex Markov Decision Processes (MDPs) problems. However, agents often face unanticipated environmental changes after deployment in the real world. These changes are often spurious and unrelated to the underlying problem, such as background shifts for visual input agents. Unfortunately, deep RL policies are usually sensitive to these changes and fail to act robustly against them. This resembles the problem of domain generalization in supervised learning. In this work, we study this problem for goal-conditioned RL agents. We propose a theoretical framework in the Block MDP setting that characterizes the generalizability of goal-conditioned policies to new environments. Under this framework, we develop a practical method PA-SkewFit that enhances domain generalization. The empirical evaluation shows that our goal-conditioned RL agent can perform well in various unseen test environments, improving by 50% over baselines.
翻译:深入强化学习(RL)成功地解决了许多复杂的Markov决策程序(MDPs)问题。然而,代理商在现实世界被部署后常常面临无法预料的环境变化。这些变化往往是虚假的,与根本问题无关,例如视觉输入代理商的背景变化。不幸的是,深强化学习政策通常对这些变化敏感,并且没有采取有力的对应行动。这类似于在监督学习过程中的广域化问题。在这项工作中,我们研究了受目标制约的RL代理商的这一问题。我们建议在块MDP设置中建立一个理论框架,以说明有目标条件的政策对新环境的可概括性。在这个框架内,我们开发了一个实用的方法PA-SkewFit,以加强域的概括化。经验评估表明,我们有目标条件的RL代理商可以在各种看不见的测试环境中很好地工作,比基线高出50%。