Recent years have witnessed significant progresses in deep Reinforcement Learning (RL). Empowered with large scale neural networks, carefully designed architectures, novel training algorithms and massively parallel computing devices, researchers are able to attack many challenging RL problems. However, in machine learning, more training power comes with a potential risk of more overfitting. As deep RL techniques are being applied to critical problems such as healthcare and finance, it is important to understand the generalization behaviors of the trained agents. In this paper, we conduct a systematic study of standard RL agents and find that they could overfit in various ways. Moreover, overfitting could happen "robustly": commonly used techniques in RL that add stochasticity do not necessarily prevent or detect overfitting. In particular, the same agents and learning algorithms could have drastically different test performance, even when all of them achieve optimal rewards during training. The observations call for more principled and careful evaluation protocols in RL. We conclude with a general discussion on overfitting in RL and a study of the generalization behaviors from the perspective of inductive bias.
翻译:近些年来,在深入强化学习(RL)方面取得了显著进展。 借助大型神经网络、精心设计的建筑、新培训算法和大量平行的计算装置,研究人员能够应对许多挑战性RL的问题。然而,在机器学习中,更多的培训能力带来更大的超容风险。随着深入的RL技术被应用于医疗保健和金融等关键问题,理解受过培训的代理人员的一般行为非常重要。在本文中,我们对标准RL代理物进行了系统的研究,发现它们可以以各种方式过度适用。此外,过度适用可能发生“粗糙”的情况:在RL中常用的技术,增加随机性并不一定能来防止或检测过度匹配。特别是,同样的代理物和学习算法可能会产生截然不同的测试性能,即使它们在培训期间都获得了最佳的回报。观察要求在RL中制定更加有原则性和仔细的评估程序。我们最后是就过度适用RL代理物剂问题进行一般性讨论,并从感性偏向偏见的角度研究一般化行为的研究。