Agents trained with deep reinforcement learning algorithms are capable of performing highly complex tasks including locomotion in continuous environments. We investigate transferring the learning acquired in one task to a set of previously unseen tasks. Generalization and overfitting in deep reinforcement learning are not commonly addressed in current transfer learning research. Conducting a comparative analysis without an intermediate regularization step results in underperforming benchmarks and inaccurate algorithm comparisons due to rudimentary assessments. In this study, we propose regularization techniques in deep reinforcement learning for continuous control through the application of sample elimination, early stopping and maximum entropy regularized adversarial learning. First, the importance of the inclusion of training iteration number to the hyperparameters in deep transfer reinforcement learning will be discussed. Because source task performance is not indicative of the generalization capacity of the algorithm, we start by acknowledging the training iteration number as a hyperparameter. In line with this, we introduce an additional step of resorting to earlier snapshots of policy parameters to prevent overfitting to the source task. Then, to generate robust policies, we discard the samples that lead to overfitting via a method we call strict clipping. Furthermore, we increase the generalization capacity in widely used transfer learning benchmarks by using maximum entropy regularization, different critic methods, and curriculum learning in an adversarial setup. Subsequently, we propose maximum entropy adversarial reinforcement learning to increase the domain randomization. Finally, we evaluate the robustness of these methods on simulated robots in target environments where the morphology of the robot, gravity, and tangential friction coefficient of the environment are altered.
翻译:经过深加学习算法培训的代理人员能够执行高度复杂的任务,包括连续环境中的升温。我们调查把在一个任务中获得的学习转移到一套先前不为人知的任务中。在目前的转移学习研究中,一般化和在深加学习中过度适应学习并不常见。在没有中期正规化步骤的情况下进行比较分析的结果,由于简单的评估而导致基准业绩不佳和算法比较不准确。在这项研究中,我们提出深强化学习的正规化技术,以便通过采用抽样消除、早期停止和最大限度的正规对抗性学习来持续控制。首先,我们将讨论将培训迭代数纳入深度转移强化学习中超常分数的重要性。由于源任务绩效并不表明算法的普遍化能力,因此我们首先承认培训迭代数是一个超常的参数。根据这一点,我们又提出一个额外的步骤,即利用早期的政策参数的简况来防止过分适应源任务。然后,为了制定稳健的政策,我们抛弃了通过严格剪裁的方法导致过度适应的样本。此外,我们提高在广泛转换变现变正比法中的总体化能力,我们在广泛采用的升级变正对称的变比标准课程中,我们学习了最高级的升级的升级方法,最后学习了升级方法。