We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. This is for example the case for maze environments where ideally an optimal path is determined. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. Our proposed method called policy horizon regression (PHR) uses knowledge of the environment sampled by A2C to learn an n dimensional policy vector in a policy distillation setup which yields n sequential actions per observation. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.
翻译:我们提出了在需要一系列具体行动才能解决的环境里提高推论性能的新办法。 例如,在迷宫环境中,最理想的最佳路径是确定。 我们不想为单步学习一项政策,而是想学习一项能够预先预测 n 行动的政策。 我们提出的政策地平线回归(PHR) 方法使用A2C 所取样的环境知识, 在一个政策蒸馏装置中学习一个n 维维政策矢量, 以产生每次观测的 n 相继动作。 我们在MiniGrid 和 Pong 环境中测试我们的方法, 并在推断期间通过成功预测单个观测的行动序列显示快速加速。