Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., $L_2$ regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. Our findings are shown to be robust against training hyperparameter variations. We also compare these techniques with the more widely used entropy regularization. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. We further analyze why regularization may help generalization in RL from four perspectives - sample complexity, reward distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/iclr2021_rlreg .
翻译:深入强化学习(Deep RL)由于在各种控制任务上的表现令人鼓舞而日益受到越来越多的关注。然而,培训神经网络的常规正规化技术(例如,$L_2美元正规化、辍学)在常规监管方法中基本上被忽略,可能是因为代理人通常在同一个环境中接受培训和评价,还因为深层RL社区更注重高层次的算法设计。在这项工作中,我们首次全面研究了正规化技术,在连续控制任务方面采用了多种政策优化算法。有趣的是,我们发现政策网络的常规正规化技术往往能够带来很大的改进,特别是在更艰巨的任务方面。我们的调查结果显示,在培训超参数变异方面是强有力的。我们还将这些技术与更为广泛使用的昆虫正规化方法进行比较。此外,我们还研究不同组成部分的正规化,发现只有政策网络的正规化通常是最好的。我们进一步分析了为什么正规化能够从四个角度帮助将常规化技术普及到常规化――抽样复杂性、报酬分配、重量规范以及噪音坚固度。我们希望我们的研究能够指导今后将政策优化算法正规化的做法。我们的代码可在 https://girubrx_grx.