In this paper we propose an on-line policy iteration (PI) algorithm for finite-state infinite horizon discounted dynamic programming, whereby the policy improvement operation is done on-line, only for the states that are encountered during operation of the system. This allows the continuous updating/improvement of the current policy, thus resulting in a form of on-line PI that incorporates the improved controls into the current policy as new states and controls are generated. The algorithm converges in a finite number of stages to a type of locally optimal policy, and suggests the possibility of variants of PI and multiagent PI where the policy improvement is simplified. Moreover, the algorithm can be used with on-line replanning, and is also well-suited for on-line PI algorithms with value and policy approximations.
翻译:在本文中,我们建议对有限状态无限地平线折扣动态编程进行在线政策迭代算法(PI),根据这种算法,政策改进操作只能对系统运行期间遇到的州进行在线进行,这样可以不断更新/改进现行政策,从而形成一种在线PI形式,在产生新的状态和控制措施时将改进的控制措施纳入现行政策。 算法在有限的几个阶段中与一种当地最佳政策相融合,并提出了政策改进简化的PI和多试剂 PI变方的可能性。 此外,该算法可以在网上重新规划时使用,并且也完全适合具有价值和政策近似值的在线 PI算法。