The success of deep reinforcement learning (DRL) hinges on the availability of training data, which is typically obtained via a large number of environment interactions. In many real-world scenarios, costs and risks are associated with gathering these data. The field of offline reinforcement learning addresses these issues through outsourcing the collection of data to a domain expert or a carefully monitored program and subsequently searching for a batch-constrained optimal policy. With the emergence of data markets, an alternative to constructing a dataset in-house is to purchase external data. However, while state-of-the-art offline reinforcement learning approaches have shown a lot of promise, they currently rely on carefully constructed datasets that are well aligned with the intended target domains. This raises questions regarding the transferability and robustness of an offline reinforcement learning agent trained on externally acquired data. In this paper, we empirically evaluate the ability of the current state-of-the-art offline reinforcement learning approaches to coping with the source-target domain mismatch within two MuJoCo environments, finding that current state-of-the-art offline reinforcement learning algorithms underperform in the target domain. To address this, we propose data valuation for offline reinforcement learning (DVORL), which allows us to identify relevant and high-quality transitions, improving the performance and transferability of policies learned by offline reinforcement learning algorithms. The results show that our method outperforms offline reinforcement learning baselines on two MuJoCo environments.
翻译:深层强化学习的成功取决于培训数据的提供情况,而培训数据通常是通过大量环境互动获得的。在许多现实世界情景中,成本和风险与收集这些数据相关。脱线强化学习领域通过将数据收集工作外包给一个域专家或一个经过仔细监测的程序来解决这些问题,并随后寻找一个分批限制的最佳政策。随着数据市场的出现,在内部建立数据集的替代办法是购买外部数据。然而,尽管最先进的脱线强化学习方法显示了许多希望,但目前它们依赖精心构建的数据集,这些数据集与预期的目标领域完全一致。这提出了关于通过外部获取数据培训的脱线强化学习代理的可转移性和稳健性问题。在本文件中,我们实证地评估了当前最新的离线强化学习方法的能力,以在两个 MuJoco环境下应对源目标目标目标区域内的源目标目标目标区域内的脱线强化在线强化学习算法不匹配。为了解决这个问题,我们提议通过学习高质量的学习模式来强化高质量的升级方法。