We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate $\widetilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$ which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.
翻译:我们研究了在最短路径(SSP)设置中学习的问题,在这种设置中,一个代理商试图将预期成本降到最低,从而在达到目标状态之前最大限度地减少预期成本。我们设计了一个新的基于模型的EB-SSP算法(EB-SSP),该算法谨慎地扭曲了经验过渡过程,并干扰了经验成本,并附带了勘探奖金,以保证乐观和相关价值迭代计划趋同。我们证明,EB-SSP取得了最低遗憾率$(全局){O}(B ⁇ star}\sqrt{SA K}(美元)美元是事件数量,美元是州数量,美元是基于模型的算法算法计算出行动数量,美元将最佳政策的预期累积成本与任何州挂钩,从而缩小了与较低约束值的差距。有趣的是,EB-SSP取得了这一结果,而没有参数,也就是说,它并不需要事先知道$B ⁇ star}美元,也不需要$T ⁇ star}$($)美元是州值(美元),它将预期的时间-Dal-mainalalal-alal assion) 确定任何状况。