We make significant progress toward the stochastic shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $\widetilde{O}(\sqrt{S^2ADT_\star K})$ regret for the full-information setting and $\widetilde{O}(\sqrt{S^3A^2DT_\star K})$ regret for the bandit feedback setting, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition. To remedy the gap between our upper bounds and the current best lower bounds constructed via a stochastically oblivious adversary, we also propose algorithms with near-optimal regret for this special case.
翻译:我们对于以对抗性成本和不为人知的过渡为最短路径问题取得了显著进展。 具体地说, 我们开发了算法, 实现美元全局化交易( sqrt{S2ADT ⁇ star K}), 对完整信息设置和全局化交易( scrt{S3A2A2D ⁇ star K}) 感到遗憾( skrt{ sqrt{S3A2A2D ⁇ star K} ), 对土匪反馈设置( 美元是直径, 美元是最佳政策预期的打击时间 ) 。 美元是州数, 美元是行动的数量, 美元是行动的数量, 美元是事件的数量。 我们的工作严格改进了( Rosenberg 和 Mansours, 2020), 整个信息设置中的信息设置( ), 从已知的过渡到未知的过渡( Chen et al. 2020 ), 以及第一个考虑最具有挑战性的组合: 与对抗性成本和未知的过渡。 。 为了弥补我们上界与目前通过特殊的敌人构建的最低约束之间的差距, 我们还提议与近的算法。