We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the scale of rewards $B_\star$ is unknown. We show that there is no price for adaptation, and our regret bound matches that with a known $B_\star$. We also provide a scale adaptation procedure for the special case of stochastic longest paths (SLP) where all rewards are non-negative. However, unlike in SSP, we show through a lower bound that there is an unavoidable price for adaptation.
翻译:我们研究在随机路径(SP)问题上的强化学习。 这些问题的目标是在代理商到达终点状态之前最大限度地增加预期的回报总额。 我们通过分析简单的乐观算法,为这个一般性问题提供第一个遗憾保证。 我们的遗憾结合与所有非积极奖励的精心研究的随机捷径(SSP)特殊案例的已知最佳结果相匹配。 对于SSP, 我们提出一个适应程序, 用于奖励规模未知的个案 $B ⁇ star $。 我们显示适应没有价格, 我们的遗憾结合与已知的$B ⁇ star 相匹配。 我们还为所有回报都是非负的随机长路(SLP)的特殊案例提供了规模调整程序。 但是, 我们通过一个较低的约束, 我们通过一个较低的约束, 显示适应的代价是不可避免的。