We propose two algorithms that use linear function approximation (LFA) for stochastic shortest path (SSP) and bound their regret over $K$ episodes. When all stationary policies are proper, our first algorithm obtains sublinear regret ($K^{3/4}$), is computationally efficient, and uses stationary policies. This is the first LFA algorithm with these three properties, to the best of our knowledge. Our second algorithm improves the regret to $\sqrt{K}$ when the feature vectors satisfy certain assumptions. Both algorithms are special cases of a more general one, which has $\sqrt{K}$ regret for general features given access to a certain computation oracle. These algorithms and regret bounds are the first for SSP with function approximation.
翻译:我们建议两种算法,用线性函数近似值(LFA)来测量最短路径(SSP),并约束他们的遗憾超过$K美元。当所有固定政策都适用时,我们的第一个算法获得了亚线性遗憾(K ⁇ 3/4}$),这是计算效率高的,并且使用了固定政策。这是我们所知道的这三种属性的第一个LFA算法。我们的第二个算法在特性矢量满足某些假设时将遗憾提高到$\sqrt{K}$。两种算法都是比较普通的,对于某些计算符的一般特性,都有$\sqrt{K}的遗憾。这些算法和遗憾界限是功能接近的 SSP的第一个。