Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset without requiring further agent-environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network's embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring handcrafted reward annotations. We provide a formal theoretical construct that offers insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD enables robust offline policy learning and achieves performance competitive with traditional reward-annotated methods.
翻译:离线强化学习(Reinforcement Learning, RL)旨在从静态数据集中学习有效的策略,而无需进行额外的智能体-环境交互。然而,其实际应用常因需要显式的奖励标注而受到阻碍,这些标注往往设计成本高昂或难以事后获取。为解决这一问题,我们提出了ReLOAD(通过蒸馏进行离线奖励标注的强化学习),一种用于离线RL的新型奖励标注框架。与现有依赖复杂对齐过程的方法不同,我们的方法采用随机网络蒸馏(Random Network Distillation, RND),通过一种简单而有效的嵌入差异度量,从专家示范中生成内在奖励。首先,我们训练一个预测器网络,使其基于专家状态转移来模仿一个固定目标网络的嵌入表示。随后,这两个网络之间的预测误差将作为静态数据集中每个状态转移的奖励信号。该机制提供了结构化的奖励信号,无需人工设计奖励标注。我们提供了一个形式化的理论框架,深入阐释了RND预测误差如何通过区分专家类转移来有效充当内在奖励。在D4RL基准测试上的实验表明,ReLOAD能够实现稳健的离线策略学习,并取得了与传统奖励标注方法相竞争的性能。