Deep reinforcement learning algorithms that learn policies by trial-and-error must learn from limited amounts of data collected by actively interacting with the environment. While many prior works have shown that proper regularization techniques are crucial for enabling data-efficient RL, a general understanding of the bottlenecks in data-efficient RL has remained unclear. Consequently, it has been difficult to devise a universal technique that works well across all domains. In this paper, we attempt to understand the primary bottleneck in sample-efficient deep RL by examining several potential hypotheses such as non-stationarity, excessive action distribution shift, and overfitting. We perform thorough empirical analysis on state-based DeepMind control suite (DMC) tasks in a controlled and systematic way to show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms, and prior methods that lead to good performance do in fact, control the validation TD error to be low. This observation gives us a robust principle for making deep RL efficient: we can hill-climb on the validation TD error by utilizing any form of regularization techniques from supervised learning. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.
翻译:通过与环境积极交互来学习策略的深度强化学习算法必须从有限的数据中学习。许多先前的工作已经表明,适当的正则化技术对于实现数据高效强化学习至关重要,但是关于保障数据高效强化学习的瓶颈的普遍理解仍不清晰。因此,很难设计一种适用于所有领域的通用技术。在本文中,我们试图通过检查几个潜在的假设(如非稳态、过度行为分布转移和过度拟合)来了解样本高效深度 RL 中的主要瓶颈。我们对基于状态的 deepmind 控制套件(DMC)任务进行了全面的实证分析,以控制并系统地展示高时间差错(TD)验证集转换是深度 RL 算法严重影响性能的主要罪魁祸首,而导致良好性能的先前方法实际上是控制验证 TD 错误保持低。这一观察结果为我们提供了一个使深度 RL 高效的坚实原则:我们可以使用来自监督学习的任何形式的正则化技术来在验证 TD 错误上进行希尔爬升。我们展示了一种简单的在线模型选择方法,以便在基于状态的 DMC 和 Gym 任务上进行验证 TD 错误,该方法是有效的。