Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).
翻译:行为克隆(BC)是一种常用的模拟学习方法,用来从专家演示中推导顺序决策政策。然而,当数据质量不理想时,由此产生的行为政策也会在部署后进行次优化。最近,离线强化学习方法激增,使从亚最佳历史数据中提取高质量政策的希望从亚最佳历史数据中产生。一个共同的做法是在培训期间进行常规化,鼓励在政策评估和(或)政策改进期间进行更新,以接近基本数据。在这项工作中,我们调查改进现有数据质量的离线方法是否能在不改变BC算法的情况下导致改进行为政策。拟议的数据改进方法-轨迹 Stitching(TS)-通过“缝合”州配对,在原始数据中互不相干,并产生新的行动。通过构建,这些新的转变根据环境的预测性模型,以及改进州-估值的功能,我们调查了改进行为政策质量政策的脱轨方法。我们证明,在使用新的不断升级的政策上,通过不断更新的模型,我们用不断更新的模型来展示不断更新的BC政策。