We study reinforcement learning (RL) with no-reward demonstrations, a setting in which an RL agent has access to additional data from the interaction of other agents with the same environment. However, it has no access to the rewards or goals of these agents, and their objectives and levels of expertise may vary widely. These assumptions are common in multi-agent settings, such as autonomous driving. To effectively use this data, we turn to the framework of successor features. This allows us to disentangle shared features and dynamics of the environment from agent-specific rewards and policies. We propose a multi-task inverse reinforcement learning (IRL) algorithm, called \emph{inverse temporal difference learning} (ITD), that learns shared state features, alongside per-agent successor features and preference vectors, purely from demonstrations without reward labels. We further show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $\Psi \Phi$-learning (pronounced `Sci-Fi'). We provide empirical evidence for the effectiveness of $\Psi \Phi$-learning as a method for improving RL, IRL, imitation, and few-shot transfer, and derive worst-case bounds for its performance in zero-shot transfer to new tasks.
翻译:我们研究强化学习(RL),进行无回报演示,使RL代理商能够从具有相同环境的其他代理商的相互作用中获得额外数据,但无法获取这些代理商的奖赏或目标,他们的目标和专长水平可能大相径庭。这些假设在多试剂环境中是常见的,例如自主驾驶。为了有效地使用这些数据,我们转向后续特征框架。这使我们能够将环境的共同特点和动态与代理人特有的奖赏和政策分解开来。我们建议采用多任务反向强化学习算法,称为\emph{反时间差异学习}(ITD),与每个代理商的继承特点和偏好矢量一起学习共同的状态特征,纯粹从没有奖赏标签的演示中学习。我们进一步展示如何将IT与在线环境互动学习的无缝结合,通过新的算法来强化与演示的学习,要求$\Psi\Phi$学习。我们提供了经验证据,用于改进最差的R-Psi\Philex工作,以及将I-shall-shop-trane-trading a rographal ro-training a rogration rogyal ro-pal roshal rogyal rogy-trace),我们提供了经验证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据证据,用于改进了美元/Risal-shifttrafttradudududuft 。