We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for $K$ episodes, and has to learn to approximate the optimal policy as closely as possible. In this work we show that the minimax regret for this setting is $\widetilde O(B_\star \sqrt{|S| |A| K})$ where $B_\star$ is a bound on the expected cost of the optimal policy from any state, $S$ is the state space, and $A$ is the action space. This matches the lower bound of Rosenberg et al. (2020) up to logarithmic factors, and improves their regret bound by a factor of $\sqrt{|S|}$. Our algorithm runs in polynomial-time per episode, and is based on a novel reduction to reinforcement learning in finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends only logarithmically on the horizon, yielding the same regret guarantees for SSP.
翻译:我们研究了最短路径(SSP)问题,在这个问题上,代理商必须达到最低预期总成本的目标状态。在研究这一问题的过程中,代理商对模型的成本和动态没有事先了解。她反复与美元事件模型进行互动,并且必须尽可能接近最佳政策。在这项工作中,我们表明,对这个设置的微小遗憾是美元全局O(B ⁇ star \sqrt ⁇ S ⁇ Z ⁇ A ⁇ K})美元,因为B ⁇ star$是任何州最佳政策预期成本的固定单位,美元是州空间,美元是行动空间。这与罗森伯格等人(202020年)较低的交易范围相匹配,可以达到逻辑因素,并增加他们受美元因素约束的遗憾程度。我们的算法在每集邮时运行,并基于新减少,以加强在限分里松 MDP的学习。到此目的,我们只能提供一个低调的日历,也就是Srizonal, 也就是为Sriquerical-slavements) 设定了Slimal-sal-revorm。