Many practical applications of reinforcement learning (RL) constrain the agent to learn from a fixed offline dataset of logged interactions, which has already been gathered, without offering further possibility for data collection. However, commonly used off-policy RL algorithms, such as the Deep Q Network and the Deep Deterministic Policy Gradient, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this offline setting. As the first step towards useful offline RL algorithms, we analysis the reason of instability in standard off-policy RL algorithms. It is due to the bootstrapping error. The key to avoiding this error, is ensuring that the agent's action space does not go out of the fixed offline dataset. Based on our consideration, a creative offline RL framework, the Least Restriction (LR), is proposed in this paper. The LR regards selecting an action as taking a sample from the probability distribution. It merely set a little limit for action selection, which not only avoid the action being out of the offline dataset but also remove all the unreasonable restrictions in earlier approaches (e.g. Batch-Constrained Deep Q-Learning). In the further, we will demonstrate that the LR, is able to learn robustly from different offline datasets, including random and suboptimal demonstrations, on a range of practical control tasks.
翻译:强化学习的许多实际应用(RL)限制了代理商从已经收集的登录互动的固定离线数据集中学习,而没有为数据收集提供进一步的可能性。然而,通常使用的离政策RL算法,如深Q网络和深确定性政策梯度等,如果没有与当前政策下分布相关的数据,就无法学习,使其对脱线设置无效。作为向有用的离线RL算法迈出的第一步,我们分析了标准离线逻辑的不稳定性原因。这是由于靴式错误。避免这一错误的关键在于确保代理商的行动空间不会从固定离线数据集中流出。根据我们的考虑,本文件提出了创造性的离线RL框架,即最小限制框架。 LR认为选择一项行动从概率分布中提取样本是无效的。我们只是为行动选择设定了很小的限制,这不仅避免了离线式数据设置的行动,而且还消除了先前的不合理的演示方法中的所有不合理的限制(e.catch-Condroad to explain the exprecreaction the destal la) 将进一步显示我们所要学习的系统。