Solving continuous Partially Observable Markov Decision Processes (POMDPs) is challenging, particularly for high-dimensional continuous action spaces. To alleviate this difficulty, we propose a new sampling-based online POMDP solver, called Adaptive Discretization using Voronoi Trees (ADVT). It uses Monte Carlo Tree Search in combination with an adaptive discretization of the action space as well as optimistic optimization to efficiently sample high-dimensional continuous action spaces and compute the best action to perform. Specifically, we adaptively discretize the action space for each sampled belief using a hierarchical partition called Voronoi tree, which is a Binary Space Partitioning that implicitly maintains the partition of a cell as the Voronoi diagram of two points sampled from the cell. ADVT uses the estimated diameters of the cells to form an upper-confidence bound on the action value function within the cell, guiding the Monte Carlo Tree Search expansion and further discretization of the action space. This enables ADVT to better exploit local information with respect to the action value function, allowing faster identification of the most promising regions in the action space, compared to existing solvers. Voronoi trees keep the cost of partitioning and estimating the diameter of each cell low, even in high-dimensional spaces where many sampled points are required to cover the space well. ADVT additionally handles continuous observation spaces, by adopting an observation progressive widening strategy, along with a weighted particle representation of beliefs. Experimental results indicate that ADVT scales substantially better to high-dimensional continuous action spaces, compared to state-of-the-art methods.
翻译:持续解析部分可观测的 Markov 决策进程( POMDPs) 具有挑战性, 特别是对于高维持续行动空间。 为了缓解这一困难, 我们提议一个新的基于取样的在线 POMDP 解析器, 名为 Voronoi 树( ADVT ) 。 它使用 Monte Carlo 树搜索, 加上一个适应性分解动作空间, 以及乐观优化, 以高效抽样高维连续行动空间, 并计算执行的最佳动作。 具体地说, 我们利用一个名为 Voronoi 树的等级分隔, 将每个样本空间的操作空间分解, 这是一种二维空间分解, 暗中维持一个细胞的分解, 称为VOMoronoooi图, 由两个样本样本中的两个点组成。 ADVVT 使用估计的直径, 引导 Monte Carloe 搜索空间的扩展和进一步分解操作空间。 使ADVT 更好地利用行动值功能, 更快地识别行动空间中最有前景的区域,, 与现有的直径直径 直径 直径比 。