Unsupervised reinforcement learning aims to acquire skills without prior goal representations, where an agent automatically explores an open-ended environment to represent goals and learn the goal-conditioned policy. However, this procedure is often time-consuming, limiting the rollout in some potentially expensive target environments. The intuitive approach of training in another interaction-rich environment disrupts the reproducibility of trained skills in the target environment due to the dynamics shifts and thus inhibits direct transferring. Assuming free access to a source environment, we propose an unsupervised domain adaptation method to identify and acquire skills across dynamics. Particularly, we introduce a KL regularized objective to encourage emergence of skills, rewarding the agent for both discovering skills and aligning its behaviors respecting dynamics shifts. This suggests that both dynamics (source and target) shape the reward to facilitate the learning of adaptive skills. We also conduct empirical experiments to demonstrate that our method can effectively learn skills that can be smoothly deployed in target.
翻译:未经监督的强化学习旨在获得技能,而无需事先提出目标说明,在这种情况下,代理商自动探索一个开放的环境,以代表目标并学习有目标条件的政策。然而,这一程序往往耗费时间,限制了在某些可能昂贵的目标环境中的推广。在另一个互动丰富的环境中开展培训的直观方法,由于动态变化,干扰了在目标环境中再复制经过培训的技能,从而阻碍了直接转让。假设可以自由进入源环境,我们建议采用不受监督的域适应方法,以发现和获得跨动态的技能。特别是,我们引入了KL常规化目标,鼓励技能的出现,奖励代理人发现技能并调整其与动态变化有关的行为。这表明,动态(源和目标)和目标都塑造奖励,以促进适应技能的学习。我们还进行实验,以证明我们的方法能够有效地学习在目标中顺利部署的技能。