Offline reinforcement learning (RL) enables agents to learn optimal policies from pre-collected datasets. However, datasets containing suboptimal and fragmented trajectories present challenges for reward propagation, resulting in inaccurate value estimation and degraded policy performance. While trajectory stitching via generative models offers a promising solution, existing augmentation methods frequently produce trajectories that are either confined to the support of the behavior policy or violate the underlying dynamics, thereby limiting their effectiveness for policy improvement. We propose ASTRO, a data augmentation framework that generates distributionally novel and dynamics-consistent trajectories for offline RL. ASTRO first learns a temporal-distance representation to identify distinct and reachable stitch targets. We then employ a dynamics-guided stitch planner that adaptively generates connecting action sequences via Rollout Deviation Feedback, defined as the gap between target state sequence and the actual arrived state sequence by executing predicted actions, to improve trajectory stitching's feasibility and reachability. This approach facilitates effective augmentation through stitching and ultimately enhances policy learning. ASTRO outperforms prior offline RL augmentation methods across various algorithms, achieving notable performance gain on the challenging OGBench suite and demonstrating consistent improvements on standard offline RL benchmarks such as D4RL.
翻译:离线强化学习(RL)使智能体能够从预先收集的数据集中学习最优策略。然而,包含次优和碎片化轨迹的数据集给奖励传播带来了挑战,导致价值估计不准确和策略性能下降。虽然通过生成模型进行轨迹拼接提供了一种有前景的解决方案,但现有的数据增强方法常常生成的轨迹要么局限于行为策略的支持范围内,要么违反了底层动力学规律,从而限制了其对策略改进的有效性。我们提出了ASTRO,一种为离线RL生成分布新颖且符合动力学的轨迹的数据增强框架。ASTRO首先学习一种时间距离表示,以识别不同且可达的拼接目标。随后,我们采用一种动力学引导的拼接规划器,该规划器通过“展开偏差反馈”(定义为目标状态序列与执行预测动作后实际到达状态序列之间的差距)自适应地生成连接动作序列,以提高轨迹拼接的可行性和可达性。该方法通过拼接实现了有效的数据增强,最终提升了策略学习效果。ASTRO在多种算法上均优于先前的离线RL数据增强方法,在具有挑战性的OGBench套件上取得了显著的性能提升,并在D4RL等标准离线RL基准测试中展现出一致的改进。