Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method LaP3 which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, LaP3 outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into the planning components of model-based RL such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 39% on average across 9 tasks, and in molecular design by up to 0.4 on properties on a 0-1 scale. Code is available at https://github.com/yangkevin2/neurips2021-lap3.
翻译:高效发现高回报轨迹的路径规划问题,是高效发现高回报轨迹的问题,往往要求优化高维和多式奖赏功能。像 CEM 和 CMA-ES 这样的广受欢迎的方法,如CEM 和 CMA-ES 等,贪婪地聚焦于有希望的搜索空间区域,可能会被困在本地最大范围。DO和VOOOT 平衡探索与开发,但使用独立于奖励功能的空间分割战略来优化。最近,LAMCTS 实验性地学会了以对奖赏敏感的方式分割搜索空间,以便优化黑盒的优化。在本文中,我们为这种适应性区域分区分区分区分区分区分区分区分区分区分配计划机制何时和为什么起作用,而CMA-ES-ES 的流行方法则提出了新的路径规划方法 LaP3, 这种方法改进了每个分区的功能值估计,并使用了搜索空间的潜在代表。 从概念上看, LaP3 超越了2D 导航任务的现有路径规划方法, 特别是在困难到环境20 本地选择21 中, 并且显示在连接基于模型的规划组成部分(例如 PETSETS) 等的RL) 取得的好处。这些收益转移到高度多式真实世界任务, 将转移到高度-wermaxxxxxxxxxxxxxxxx