Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors.
翻译:实时动态编程(RTDP)是一种基于动态编程(DP)的在线算法,它以1步贪婪计划(DP)为基础。与DP不同的是,RTDP并不要求进入整个州空间,也就是说,它明确处理勘探。这一事实使得RTDP在州空间巨大且不可能同时更新所有各州时特别吸引。在此过程中,我们设计了一个多步贪婪的RTDP算法,我们称之为$h$-RTDP,以1步贪婪政策取代1步贪婪政策,以1美元步式的外观政策。我们以准确的形式分析$-RTDP,并确定增加外观视野($h)的结果是改进了样本复杂性,并增加了计算费用。这是第一个证明由于在线规划中外观视野的增加而使样本复杂性得到改善的工作。我们然后将美元-RTDP的绩效分析在三种大致情况下:大致模式、近似值更新和近似州代表制。对于这些情况,我们证明,美元-RTDP的模拟性性性性表现可以使DP的精确度更接近于希望。