Reinforcement Learning (RL) techniques have drawn great attention in many challenging tasks, but their performance deteriorates dramatically when applied to real-world problems. Various methods, such as domain randomization, have been proposed to deal with such situations by training agents under different environmental setups, and therefore they can be generalized to different environments during deployment. However, they usually do not incorporate the underlying environmental factor information that the agents interact with properly and thus can be overly conservative when facing changes in the surroundings. In this paper, we first formalize the task of adapting to changing environmental dynamics in RL as a generalization problem using Contextual Markov Decision Processes (CMDPs). We then propose the Asymmetric Actor-Critic in Contextual RL (AACC) as an end-to-end actor-critic method to deal with such generalization tasks. We demonstrate the essential improvements in the performance of AACC over existing baselines experimentally in a range of simulated environments.
翻译:强化学习(RL)技术在许多具有挑战性的任务中引起了极大的注意,但在应用到现实世界的问题时,其表现急剧恶化。各种方法,如域随机化,已经提议在不同环境设置下培训人员处理这种情况,因此在部署期间可以推广到不同环境。但是,这些技术通常没有纳入这些人员适当互动的环境因素信息,因此在面临周围变化时可能过于保守。在本文件中,我们首先正式确定了适应RL中不断变化的环境动态的任务,作为使用环境标志决定程序(CMDPs)的笼统化问题。我们然后提议在背景RL(ACC)中采用非对称行动-Critical(AACC)作为处理这种概括化任务的端到端的行为者-critic 方法。我们展示了AAC在一系列模拟环境中对现有基线进行实验性改进的基本情况。