In previous work, using a process we call meshing, the reachable state spaces for various continuous and hybrid systems were approximated as a discrete set of states which can then be synthesized into a Markov chain. One of the applications for this approach has been to analyze locomotion policies obtained by reinforcement learning, in a step towards making empirical guarantees about the stability properties of the resulting system. In a separate line of research, we introduced a modified reward function for on-policy reinforcement learning algorithms that utilizes a "fractal dimension" of rollout trajectories. This reward was shown to encourage policies that induce individual trajectories which can be more compactly represented as a discrete mesh. In this work we combine these two threads of research by building meshes of the reachable state space of a system subject to disturbances and controlled by policies obtained with the modified reward. Our analysis shows that the modified policies do produce much smaller reachable meshes. This shows that agents trained with the fractal dimension reward transfer their desirable quality of having a more compact state space to a setting with external disturbances. The results also suggest that the previous work using mesh based tools to analyze RL policies may be extended to higher dimensional systems or to higher resolution meshes than would have otherwise been possible.
翻译:在先前的工作中,我们使用一个我们称之为网格的过程,将各种连续和混合系统的可达国家空间近似为一组分离的国家,然后可以将其合成成一个离散的网格链。这一方法的应用之一是分析通过强化学习获得的移动政策,以此为一步,对由此形成的系统的稳定性特性提供经验保障。在另外一行的研究中,我们引入了对政策强化学习算法的修改奖励功能,该算法使用了推出的轨迹的“分解维度”。这一奖励的显示是为了鼓励采取政策,引导个人轨迹,然后可以更紧密地将其合成成一个离散的网格。在这项工作中,我们将这两条研究线结合起来,方法是建立一个系统可达效空间的缩略图,并用经过修改的奖励政策加以控制。我们的分析表明,经过修改的政策确实产生更小的可达度缩略图。这显示,经过培训的代理人奖励他们将更紧凑的状态空间转换为外部扰动环境的适当质量。结果还表明,如果使用更高的分辨率,则会将先前的工作扩大到更高的系统。