We investigate online Markov Decision Processes (MDPs) with adversarially changing loss functions and known transitions. We choose dynamic regret as the performance measure, defined as the performance difference between the learner and any sequence of feasible changing policies. The measure is strictly stronger than the standard static regret that benchmarks the learner's performance with a fixed compared policy. We consider three foundational models of online MDPs, including episodic loop-free Stochastic Shortest Path (SSP), episodic SSP, and infinite-horizon MDPs. For these three models, we propose novel online ensemble algorithms and establish their dynamic regret guarantees respectively, in which the results for episodic (loop-free) SSP are provably minimax optimal in terms of time horizon and certain non-stationarity measure. Furthermore, when the online environments encountered by the learner are predictable, we design improved algorithms and achieve better dynamic regret bounds for the episodic (loop-free) SSP; and moreover, we demonstrate impossibility results for the infinite-horizon MDPs.
翻译:我们研究在线Markov决策程序(MDPs),其损失功能和已知的过渡存在对抗性变化。我们选择动态遗憾作为业绩计量,其定义是学习者与任何可行的政策变化序列之间的性能差异。该措施严格地强于标准静态遗憾,即以固定的比较政策衡量学习者的业绩。我们考虑在线MDP的三种基本模式,包括零星循环无孔径最短路径(SSP)、零星SSP和无孔径MDPs。对于这三种模式,我们提出新的在线混合算法,并分别建立其动态遗憾保证,在这种模式中,在时间视野和某些非常态措施方面,SSP的结果几乎是最小的。此外,当学习者遇到的在线环境是可以预测的时,我们设计了更好的算法,并为SSPsindodic(loop-pree)实现了更动态的遗憾界限。此外,我们展示了无限正弦 MDPs的不可能的结果。