We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is $\widetilde O(\sqrt{ (B_\star^2 + B_\star) |S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the $\Omega (\sqrt{ B_\star^2 |S| |A| K})$ lower bound of Rosenberg et al. [2020] for $B_\star \ge 1$, and improves their regret bound by a factor of $\sqrt{|S|}$. For $B_\star < 1$ we prove a matching lower bound of $\Omega (\sqrt{ B_\star |S| |A| K})$. Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.
翻译:我们研究的是最短路径(SSP)问题,在这个问题上,代理商必须达到最低预期总成本的目标状态。 在研究这一问题的过程中,代理商对模型的成本和动态没有事先了解。 她反复与模型互动, 并不得不将她的遗憾降到最低。 在这项工作中, 我们显示, 对这个环境的微小遗憾是 $Bstar2+B ⁇ 2+B ⁇ star +B ⁇ star +B ⁇ star $, 其中Bstar$与任何州的最佳政策预期成本挂钩, 美元是州空间, 美元是行动空间。 这和美元( Bstar2+Z ⁇ A ⁇ K} 相比, 美元比罗森堡和AL. [2020] 更低, 美元是starstar2 +Bstar2+++Q ⁇ +B ⁇ star ++Ge g} 美元, 提高他们的遗憾由任何州最佳政策的预期成本约束。 $_Star__A 水平, 我们的预期水平比我们更低。