Learning from raw high dimensional data via interaction with a given environment has been effectively achieved through the utilization of deep neural networks. Yet the observed degradation in policy performance caused by imperceptible worst-case policy dependent translations along high sensitivity directions (i.e. adversarial perturbations) raises concerns on the robustness of deep reinforcement learning policies. In our paper, we show that these high sensitivity directions do not lie only along particular worst-case directions, but rather are more abundant in the deep neural policy landscape and can be found via more natural means in a black-box setting. Furthermore, we show that vanilla training techniques intriguingly result in learning more robust policies compared to the policies learnt via the state-of-the-art adversarial training techniques. We believe our work lays out intriguing properties of the deep reinforcement learning policy manifold and our results can help to build robust and generalizable deep reinforcement learning policies.
翻译:通过与特定环境互动从原始高维数据中学习,通过利用深厚的神经网络,已经有效地实现了从原始高维数据中学习。然而,在高度敏感方向(即对抗性扰动)上,无法察觉的最坏政策翻译导致的政策业绩明显退化,使人们对深度强化学习政策的稳健性产生担忧。 在我们的论文中,我们表明这些高度敏感方向不仅存在于特定最坏的病例方向,而是在深厚的神经政策环境中更为丰富,并且可以通过更自然的方式在黑盒环境中找到。 此外,我们表明香草培训技术令人感兴趣的是,与通过最先进的对抗性培训技术所学的政策相比,我们学习的政策更加有力。 我们相信,我们的工作显示了深度强化学习政策多方面的有趣特性,我们的成果有助于建立强有力和普遍的强化性学习政策。