In real world, affecting the environment by a weak policy can be expensive or very risky, therefore hampers real world applications of reinforcement learning. Offline Reinforcement Learning (RL) can learn policies from a given dataset without interacting with the environment. However, the dataset is the only source of information for an Offline RL algorithm and determines the performance of the learned policy. We still lack studies on how dataset characteristics influence different Offline RL algorithms. Therefore, we conducted a comprehensive empirical analysis of how dataset characteristics effect the performance of Offline RL algorithms for discrete action environments. A dataset is characterized by two metrics: (1) the average dataset return measured by the Trajectory Quality (TQ) and (2) the coverage measured by the State-Action Coverage (SACo). We found that variants of the off-policy Deep Q-Network family require datasets with high SACo to perform well. Algorithms that constrain the learned policy towards the given dataset perform well for datasets with high TQ or SACo. For datasets with high TQ, Behavior Cloning outperforms or performs similarly to the best Offline RL algorithms.
翻译:在现实世界中,由于政策薄弱,影响环境的费用可能非常昂贵或风险很大,因此会妨碍强化学习的实际应用。离线强化学习(RL)可以不与环境互动,从某个数据集中学习政策。然而,数据集是脱线RL算法的唯一信息来源,决定了所学政策的表现。我们仍然缺乏关于数据集特性如何影响不同离线RL算法的研究。因此,我们对数据集特性如何影响离散行动环境离线 RL 算法的性能进行了全面的经验分析。一个数据集有两种指标:(1) 轨迹质量(TQ)测量的平均数据集回报率和(2) 国家-行动覆盖度(SACO)测量的覆盖面。我们发现,离政策深度Q-Network家族的变式要求高SACO的数据集表现良好。对于高TQ或SACO的数据集来说,限制所学的数据集性能良好。对于高TQ、Behavior Clocome outs the the best set setrimedals。