Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.
翻译:值迭代( VI) 是一种基础动态的动态编程方法, 在最佳控制和强化学习中对于学习和规划非常重要。 VI 分批进行, 必须在下一批更新之前完成对每个州价值的更新。 如果国家空间很大, 完成一个批次的费用太高, 使VI 在许多应用中不切实际。 亚同步 VI 有助于解决巨大的州空间问题, 方法是在时间、 地点和任意顺序上一次更新一个国家。 然而, 同步 VI 仍然需要在整个行动空间上实现最大化, 这使得它对于有较大行动空间的域不切实际。 要解决这个问题, 我们建议双向不同步的值迭代谢( DAVI ), 这是一种新的算法, 将州到州和州之间的非同步性理念概括到许多应用。 更具体地说, DAVI 在抽样的一组行动上最大化, 可以使用任何用户定义大小。 但是, 使用这种简单的抽样来减少计算方法将同样的理论属性维持到VI VI, 不需要等待每批次更新的整个行动空间的完整扫描, 在接近 VVI 的轨中, 。 在每批次更新中, 我们显示一个最佳的概率返回。