EUCLID:采用多种选择动力模式,实现高效、无人监督的强化学习 (EUCLID: Towards Efficient Unsupervised Reinforcement Learning with Multi-choice Dynamics Model)

Unsupervised reinforcement learning (URL) poses a promising paradigm to learn useful behaviors in a task-agnostic environment without the guidance of extrinsic rewards to facilitate the fast adaptation of various downstream tasks. Previous works focused on the pre-training in a model-free manner while lacking the study of transition dynamics modeling that leaves a large space for the improvement of sample efficiency in downstream tasks. To this end, we propose an Efficient Unsupervised Reinforcement Learning Framework with Multi-choice Dynamics model (EUCLID), which introduces a novel model-fused paradigm to jointly pre-train the dynamics model and unsupervised exploration policy in the pre-training phase, thus better leveraging the environmental samples and improving the downstream task sampling efficiency. However, constructing a generalizable model which captures the local dynamics under different behaviors remains a challenging problem. We introduce the multi-choice dynamics model that covers different local dynamics under different behaviors concurrently, which uses different heads to learn the state transition under different behaviors during unsupervised pre-training and selects the most appropriate head for prediction in the downstream task. Experimental results in the manipulation and locomotion domains demonstrate that EUCLID achieves state-of-the-art performance with high sample efficiency, basically solving the state-based URLB benchmark and reaching a mean normalized score of 104.0$\pm$1.2$\%$ in downstream tasks with 100k fine-tuning steps, which is equivalent to DDPG's performance at 2M interactive steps with 20x more data.

翻译：不受监督的强化学习(URL)是一个很有希望的模式,可以在没有外部奖励指导的情况下,在任务-不可知环境中学习有用的行为,而无需指导外部奖励,以促进快速适应各种下游任务。先前的工作重点是以无模式的方式进行预培训,同时缺乏对过渡动态模型的研究,这为改进下游任务的抽样效率留下了很大的空间。为此,我们建议采用一个高效的不受监督的强化学习框架,并采用多选择动态模型(EUCLID),引入一个新的模式,在培训前阶段联合预先培训动态模型和不受监督的探索政策,从而更好地利用环境样本,提高下游任务取样效率。然而,建立一个通用模型,在各种行为下游任务中捕捉当地动态模型,同时覆盖不同行为下的不同地方动态。我们采用多选择模式,使用不同的领导人在未受监督的培训前(EUCLID)行为模式中学习国家转型,并选择最合适的领导人,在下游任务中进行下游费用等值的预测,从而更好地利用环境样本样本,在高水平的操作和移动数据领域,在10级标准区域中,通过实验结果将达到欧盟标准标准。