Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we present the adversarial SSP model that also accounts for adversarial changes in the costs over time, while the underlying transition function remains unchanged. Formally, an agent interacts with an SSP environment for $K$ episodes, the cost function changes arbitrarily between episodes, and the transitions are unknown to the agent. We develop the first algorithms for adversarial SSPs and prove high probability regret bounds of $\widetilde O (\sqrt{K})$ assuming all costs are strictly positive, and $\widetilde O (K^{3/4})$ in the general case. We are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it.
翻译:最短的托盘路径(SSP)在规划和控制方面是一个众所周知的问题,在这种路径中,代理人必须达到一个最低预期总成本的目标状态。在本文中,我们介绍了对抗性SSP模式,该模式也反映了长期成本的对抗性变化,而基本过渡功能保持不变。形式上,代理人与SSP环境发生互动,出现K美元事件,费用函数在时间和过渡之间任意变化,代理商不知道。我们为对抗性SSP开发了第一个算法,并证明,假设所有成本都绝对是正数,而美元(sqrt{K})和美元(k ⁇ 3/4})在一般情况下为美元,则极有可能后悔。我们首先考虑对抗性SSP的自然环境,并为此获得亚线性遗憾。