In many real-world applications, collecting large and high-quality datasets may be too costly or impractical. Offline reinforcement learning (RL) aims to infer an optimal decision-making policy from a fixed set of data. Getting the most information from historical data is then vital for good performance once the policy is deployed. We propose a model-based data augmentation strategy, Trajectory Stitching (TS), to improve the quality of sub-optimal historical trajectories. TS introduces unseen actions joining previously disconnected states: using a probabilistic notion of state reachability, it effectively `stitches' together parts of the historical demonstrations to generate new, higher quality ones. A stitching event consists of a transition between a pair of observed states through a synthetic and highly probable action. New actions are introduced only when they are expected to be beneficial, according to an estimated state-value function. We show that using this data augmentation strategy jointly with behavioural cloning (BC) leads to improvements over the behaviour-cloned policy from the original dataset. Improving over the BC policy could then be used as a launchpad for online RL through planning and demonstration-guided RL.
翻译:在许多现实应用中,收集大型高质量数据集可能过于昂贵或不切实际。 离线强化学习(RL)旨在从固定数据集中推导出最佳决策政策。 从历史数据获取最多信息对政策部署后的良好表现至关重要。 我们提出基于模型的数据增强战略,即Trajotory Stitching(TS),以提高亚最佳历史轨迹的质量。 TS引入了与先前不相干的国家联合的隐性行动:使用一种概率性的国家可达性概念,它有效地将历史演示的一部分“缝隙”一起生成新的、更高质量的数据。 缝合事件包括通过合成和极可能的行动在观察的一对国家之间过渡。 根据估计的国家价值功能,只有当预期新的行动将产生效益时,才会采取新的行动。 我们表明,使用这种数据增强战略与行为克隆(BC)一起,可以改进原始数据集的行为组合政策。 改进 BC政策,然后可以用作在线RL规划和演示指南的启动程序。