We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate $\tilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$, which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting.
翻译:我们研究了在最短路径(SSP)设置中学习的问题,在这种设置中,一个代理商试图在达到目标状态之前最大限度地减少预期成本。我们设计了一个新的基于模型的EB-SSP算法(EB-SSP),该算法谨慎地扭曲了经验转型,并干扰了经验成本,并附带了勘探奖金,以诱发乐观的SSP问题,其相关价值迭代计划有保证会汇合。我们证明,EB-SSP取得了最低负负($\tilde{O}(Bstar}\srt{Sqrt{SA}) 美元(美元是事件数量,超过SA$(美元)是国家数量,美元是行动数量,而美元(Bstar Star})将最佳政策的预期累积成本从任何状态,从而缩小与下限的距离。有趣的是,EB-SSP在没有参数的情况下,它并不需要事先知道$Bstar}或$Tztar}$(Star}$(美元)的收益是州数数, 美元),因此,它会约束了预期的直数的直值, 。