像素域内宽度规划深度政策 (Deep Policies for Width-Based Planning in Pixel Domains)

from arxiv, In Proceedings of the 29th International Conference on Automated Planning and Scheduling (ICAPS 2019). arXiv admin note: text overlap with arXiv:1806.05898

Width-based planning has demonstrated great success in recent years due to its ability to scale independently of the size of the state space. For example, Bandres et al. (2018) introduced a rollout version of the Iterated Width algorithm whose performance compares well with humans and learning methods in the pixel setting of the Atari games suite. In this setting, planning is done on-line using the "screen" states and selecting actions by looking ahead into the future. However, this algorithm is purely exploratory and does not leverage past reward information. Furthermore, it requires the state to be factored into features that need to be pre-defined for the particular task, e.g., the B-PROST pixel features. In this work, we extend width-based planning by incorporating an explicit policy in the action selection mechanism. Our method, called $\pi$-IW, interleaves width-based planning and policy learning using the state-actions visited by the planner. The policy estimate takes the form of a neural network and is in turn used to guide the planning step, thus reinforcing promising paths. Surprisingly, we observe that the representation learned by the neural network can be used as a feature space for the width-based planner without degrading its performance, thus removing the requirement of pre-defined features for the planner. We compare $\pi$-IW with previous width-based methods and with AlphaZero, a method that also interleaves planning and learning, in simple environments, and show that $\pi$-IW has superior performance. We also show that $\pi$-IW algorithm outperforms previous width-based methods in the pixel setting of Atari games suite.

翻译：近几年来,基于 Width 的规划显示取得了巨大成功, 因为它有能力独立地根据国家空间的大小进行规模评估。例如, Bandres 等人( 2018年) 引入了一个扩展版的迭代 Width 算法, 该算法的性能与人类和学习方法相比在 Atari 游戏套件像素设置中, 其性能与人类和学习方法相比很好。在此背景下, 计划使用“ 屏幕” 状态进行在线规划, 并通过展望未来来选择行动。但是, 这一算法纯粹是探索性的, 并不利用过去的奖赏性资料。此外, 它要求将国家因素纳入需要预先确定的具体特点, 例如 B- PROST 等。在这项工作中, 我们通过在动作选择机制中引入明确的政策, 将基于宽度的规划与政策学习方法联系起来。我们用基于星基的网络的形式来指导规划步骤, 从而加强具有前景前景的路径。我们用高级的系统方法来显示, 以先前的平价方法。