This paper studies the problem of learning a control policy without the need for interactions with the environment; instead, learning purely from an existing dataset. Prior work has demonstrated that offline learning algorithms (e.g., behavioural cloning and offline reinforcement learning) are more likely to discover a satisfactory policy when trained using high-quality expert data. However, many real-world/practical datasets can contain significant proportions of examples generated using low-skilled agents. Therefore, we propose a behaviour discriminator (BD) concept, a novel and simple data filtering approach based on semi-supervised learning, which can accurately discern expert data from a mixed-quality dataset. Our BD approach was used to pre-process the mixed-skill-level datasets from the Real Robot Challenge (RRC) III, an open competition requiring participants to solve several dexterous robotic manipulation tasks using offline learning methods; the new BD method allowed a standard behavioural cloning algorithm to outperform other more sophisticated offline learning algorithms. Moreover, we demonstrate that the new BD pre-processing method can be applied to a number of D4RL benchmark problems, improving the performance of multiple state-of-the-art offline reinforcement learning algorithms.
翻译:本文研究的是无需与环境互动就学习控制政策的问题;相反,我们纯粹从现有的数据集中学习;先前的工作表明,离线学习算法(例如行为克隆和离线强化学习)在培训使用高质量的专家数据时更有可能发现令人满意的政策;然而,许多实际世界/实践数据集可能包含大量使用低技能代理人产生的例子。因此,我们提议一种行为歧视者(BD)概念,一种基于半监督学习的新颖和简单的数据过滤方法,能够准确地从混合质量数据集中发现专家数据。我们采用BD方法预处理实际机器人挑战(RRC)三的混合技能数据集,这是一种公开的竞争,要求参与者使用离线学习方法解决若干超模的机器人操纵任务;新的BD方法允许标准行为克隆算法超越其他更复杂的离线学习算法。此外,我们证明新的BD预处理方法可以应用于一系列D4RL离线学习基准算法的强化。