The objective of offline RL is to learn optimal policies when a fixed exploratory demonstrations data-set is available and sampling additional observations is impossible (typically if this operation is either costly or rises ethical questions). In order to solve this problem, off the shelf approaches require a properly defined cost function (or its evaluation on the provided data-set), which are seldom available in practice. To circumvent this issue, a reasonable alternative is to query an expert for few optimal demonstrations in addition to the exploratory data-set. The objective is then to learn an optimal policy w.r.t. the expert's latent cost function. Current solutions either solve a behaviour cloning problem (which does not leverage the exploratory data) or a reinforced imitation learning problem (using a fixed cost function that discriminates available exploratory trajectories from expert ones). Inspired by the success of IRL techniques in achieving state of the art imitation performances in online settings, we exploit GAN based data augmentation procedures to construct the first offline IRL algorithm. The obtained policies outperformed the aforementioned solutions on multiple OpenAI gym environments.
翻译:离线RL的目标是,在固定的探索性示范数据集可用并且不可能进行抽样额外观察时,学习最佳政策(通常,如果这一操作费用昂贵,或引起道德问题)。为了解决这一问题,离架式方法需要一种定义合理的成本功能(或对所提供的数据集的评估),而实际上很少有这种功能。为回避这一问题,一个合理的替代办法是,在探索性数据集之外,就少数最佳演示向专家查询。然后,目标是学习一种最佳政策 w.r.t.专家的潜在成本功能。目前的解决办法要么是解决行为克隆问题(没有利用探索性数据),要么是强化模仿学习问题(使用固定的成本功能,对专家现有的探索性轨迹加以区分)。由于IRL技术成功地实现了在线环境中的艺术模仿性能,我们利用基于GAN的数据增强程序来构建第一个离线性IRL算法。获得的政策超越了上述关于多种 OpenAI体操环境的解决方案。