Policies produced by deep reinforcement learning are typically characterised by their learning curves, but they remain poorly understood in many other respects. ReLU-based policies result in a partitioning of the input space into piecewise linear regions. We seek to understand how observed region counts and their densities evolve during deep reinforcement learning using empirical results that span a range of continuous control tasks and policy network dimensions. Intuitively, we may expect that during training, the region density increases in the areas that are frequently visited by the policy, thereby affording fine-grained control. We use recent theoretical and empirical results for the linear regions induced by neural networks in supervised learning settings for grounding and comparison of our results. Empirically, we find that the region density increases only moderately throughout training, as measured along fixed trajectories coming from the final policy. However, the trajectories themselves also increase in length during training, and thus the region densities decrease as seen from the perspective of the current trajectory. Our findings suggest that the complexity of deep reinforcement learning policies does not principally emerge from a significant growth in the complexity of functions observed on-and-around trajectories of the policy.
翻译:深强化学习所产生的政策通常以学习曲线为特征,但在许多其他方面仍然不甚了解。基于雷卢的政策导致输入空间被分割成片状线性区域。我们试图了解在深强化学习过程中,利用一系列连续控制任务和政策网络层面的经验结果,观察到的区域数量及其密度是如何演变的。我们直觉地认为,在培训期间,该政策经常访问的地区的区域密度会增加,从而提供了细微的控制。我们利用在受监督学习环境中神经网络引发的线性区域的最新理论和经验结果,以定位和比较我们的成果。我们偶然地发现,在整个培训过程中,按最后政策确定的轨迹衡量,观察到的区域密度仅略有增加。然而,在培训期间,轨迹本身也会延长,因此从目前轨迹的角度看,区域密度也会减少。我们的调查结果表明,深度强化学习政策的复杂性并不主要产生于所观察到的政策轨迹的复杂性的显著增长。