校外强化学习:转让培训与域分类师 (Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers)

from arxiv, Published at ICLR 2021. Code (https://github.com/google-research/google-research/tree/master/darc) and blog post (https://blog.ml.cmu.edu/2020/07/31/maintaining-the-illusion-of-reality-transfer-in-rl-by-keeping-agents-in-the-darc)

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the modified reward function penalizes the agent for visiting states and taking actions in the source domain which are not possible in the target domain. Said another way, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

翻译：在强化学习中,我们提出一个简单、实用和直观的域适应方法。我们的方法来自这样一种想法,即该代理人在源域的经验应该与其在目标域的经验相似。从RL的概率性观点出发,我们正式表明,我们可以通过修改奖励功能来弥补动态的差异来实现这一目标。这个修改后的奖励功能很容易通过学习辅助分类师来估算,这些分类师区分源-源-源-源-端过渡。直觉地说,修改后的奖励功能会惩罚到访问国家并在源域内采取行动的代理人,而这在目标域内是不可能的。另一个说法是,该代理人因过渡表明该代理人正在与源域而不是目标域进行互动而受到惩罚。我们的方法适用于具有连续状态和行动的领域,不需要学习动态的明确模型。关于离散和连续的控制任务,我们要说明我们方法的精度,并展示其可适用于高层次任务的伸缩性。