We study the stochastic shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models. We call this class of SSP problems as linear mixture SSPs. We propose a novel algorithm with Hoeffding-type confidence sets for learning the linear mixture SSP, which can attain an $\tilde{\mathcal{O}}(d B_{\star}^{1.5}\sqrt{K/c_{\min}})$ regret. Here $K$ is the number of episodes, $d$ is the dimension of the feature mapping in the mixture model, $B_{\star}$ bounds the expected cumulative cost of the optimal policy, and $c_{\min}>0$ is the lower bound of the cost function. Our algorithm also applies to the case when $c_{\min} = 0$, and an $\tilde{\mathcal{O}}(K^{2/3})$ regret is guaranteed. To the best of our knowledge, this is the first algorithm with a sublinear regret guarantee for learning linear mixture SSP. Moreover, we design a refined Bernstein-type confidence set and propose an improved algorithm, which provably achieves an $\tilde{\mathcal{O}}(d B_{\star}\sqrt{K/c_{\min}})$ regret. In complement to the regret upper bounds, we also prove a lower bound of $\Omega(dB_{\star} \sqrt{K})$. Hence, our improved algorithm matches the lower bound up to a $1/\sqrt{c_{\min}}$ factor and poly-logarithmic factors, achieving a near-optimal regret guarantee.
翻译:我们用线性函数近似值来研究在强化学习中最短路径( SSP) 的问题 { stop 问题 { sSP 问题 : 过渡内核代表着未知模型的线性混合物。 我们将此类 SSP 问题称为线性混合物 SSP 。 我们提出一个带有 Hoffding 类型信任的新型算法, 用于学习线性混合物 SSP, 这可以达到$\ tilde\ mathcal{ O} (d B ⁇ star\ 1.5\ sqrt{K/c ⁇ min ⁇ } 。 这里讲的是事件的数量, 美元是混合模型特性绘图的尺寸, $B ⁇ star} 问题。 而且, $\\\\ min_\ 美元是成本的下限值。 我们设计了一个更精确的信任类型, 当 $\\ min= 美元, 并且 $\\\\\\\\\ krorral rorral ror= a legal roral 。