Passive observational data, such as human videos, is abundant and rich in information, yet remains largely untapped by current RL methods. Perhaps surprisingly, we show that passive data, despite not having reward or action labels, can still be used to learn features that accelerate downstream RL. Our approach learns from passive data by modeling intentions: measuring how the likelihood of future outcomes change when the agent acts to achieve a particular task. We propose a temporal difference learning objective to learn about intentions, resulting in an algorithm similar to conventional RL, but which learns entirely from passive data. When optimizing this objective, our agent simultaneously learns representations of states, of policies, and of possible outcomes in an environment, all from raw observational data. Both theoretically and empirically, this scheme learns features amenable for value prediction for downstream tasks, and our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
翻译:摘要:被动观察数据,比如人类视频,丰富而信息量大,但当前的强化学习方法基本上没有使用。令人惊讶的是,我们表明即使没有奖励或行动标签,被动数据仍然可以用于学习特征,以加速下游强化学习。我们的方法通过建模意图从被动数据中学习:通过测量代理行动达到特定任务的概率如何改变来衡量。我们提出了一种时序差分学习目标,用于学习意图,从而得出一种类似于常规强化学习的算法,但完全从被动数据中学习。在优化此目标时,我们的代理同时学习环境中的状态,策略和可能的结果表示,所有这些都是从原始观察数据中学习的。这个方案在理论和实践中都学会了预测值,适用于下游任务,我们的实验展示了从许多形式的被动数据中学习特征的能力,包括跨体验视频数据和 YouTube 视频。