Learning to act from observational data without active environmental interaction is a well-known challenge in Reinforcement Learning (RL). Recent approaches involve constraints on the learned policy or conservative updates, preventing strong deviations from the state-action distribution of the dataset. Although these methods are evaluated using non-linear function approximation, theoretical justifications are mostly limited to the tabular or linear cases. Given the impressive results of deep reinforcement learning, we argue for a need to more clearly understand the challenges in this setting. In the vein of Held & Hein's classic 1963 experiment, we propose the "tandem learning" experimental paradigm which facilitates our empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Our results provide relevant insights for offline deep reinforcement learning, while also shedding new light on phenomena observed in the online case of learning control.
翻译:在强化学习(RL)中,众所周知的挑战是不进行积极的环境互动而从观测数据中学习,而没有积极的环境互动。最近的方法涉及对所学政策或保守性更新的限制,防止了与数据集的国家行动分布的明显偏差。虽然这些方法是使用非线性函数近似值来评估的,但理论理由大多限于表格或线性案例。鉴于深层强化学习的令人印象深刻的结果,我们主张需要更清楚地了解这一环境的挑战。在Held & Hein的1963年经典实验中,我们提出了“Tandem Learning”实验模式,它有助于我们对离线强化学习的困难进行实证性分析。我们把功能接近与固定数据分布相配合确定为最强的因素,从而延长了以往工作中所述的假设,但也具有挑战性。我们的成果为离线深度强化学习提供了相关的洞察力,同时对在线学习控制案例所观察到的现象提供了新的光。