Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon na\"ive approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite. Code and videos are available at https://ycxuyingchen.github.io/cascade/
翻译:建设一般有能力的代理人是深层强化学习(RL)的一大挑战。 实际地应对这一挑战,我们提出两大关键挑战: 1) 便利普及,探索应该是不可知性的任务; 2) 促进可扩展性,勘探政策应当收集大量数据,而无需花费昂贵的集中再培训。 结合这两种特性,我们引入无酬部署效率设置,这是RL研究的新范式。 然后我们提出CASCADE,这是在这一新环境下进行自我监督探索的新办法。 CASCADE试图学习一种世界模型,方法是利用拜伊西亚积极学习所启发的信息理论目标,与一批代理人一起收集数据。 CASCADE 实现这一目标,具体通过一个新颖的分层目标,最大限度地扩大人口抽样的轨迹多样性。 我们用表格显示CASCADE的理论直觉,我们在不考虑人口多样性的“nacive”方法上有所改进。 然后我们证明,CASCADE收集了多种任务-notic数据集,并学习了将零光谱、 深层/下游/MARGRADASYGRADRADS。