Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method PlaLaM which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, PlaLaM outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into model-based RL with planning components such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 245% and in molecular design by up to 0.4 on properties on a 0-1 scale.
翻译:高效发现高回报轨迹的路径规划问题往往要求优化高维和多式奖赏功能。CEM和CMA-ES的流行方法,如CEM和CMA-ES贪婪地聚焦于有希望的搜索空间区域,并可能陷入本地最大范围。DOO和VOOOT平衡探索与开发,但使用独立于奖励功能的空间分割战略以优化利用。最近,LAMCTS实验性地学会以奖励敏感的方式分割搜索空间,以便优化黑箱优化。在本文中,我们为这种适应性区域分割计划何时和为何起作用开发了新颖的正式遗憾分析。我们还提出了新的路径规划方法PLALAM,该方法改进了每个分区的功能价值估计,并使用了搜索空间的潜在代表。在2D导航任务中,特别是在困难到环境的本地选择时,PETS等基于模型的分区规划组件插入了搜索空间时,显示了效益。这些收益转移到高度多式联运的现实任务,即改进了每个分区内的功能值估计,并使用了搜索空间的隐含度代表。PLLAMM在设计中将0.1阶段中,将0.1级的模型转换为245。