This paper studies Stochastic Shortest Path (SSP) problems in known and unknown environments from the perspective of convex optimisation. It first recalls results in the known parameter case, and develops understanding through different proofs. It then focuses on the unknown parameter case, where it studies extended value iteration (EVI) operators. This includes the existing operators used in Rosenberg et al. [26] and Tarbouriech et al. [31] based on the l-1 norm and supremum norm, as well as defining EVI operators corresponding to other norms and divergences, such as the KL-divergence. This paper shows in general how the EVI operators relate to convex programs, and the form of their dual, where strong duality is exhibited. This paper then focuses on whether the bounds from finite horizon research of Neu and Pike-Burke [21] can be applied to these extended value iteration operators in the SSP setting. It shows that similar bounds to [21] for these operators exist, however they lead to operators that are not in general monotone and have more complex convergence properties. In a special case we observe oscillating behaviour. This paper generates open questions on how research may progress, with several examples that require further examination.
翻译:本文研究已知和未知环境中的已知和未知最短路径(SSP)问题。 它首先回顾已知参数案例的结果, 并通过不同的证明发展理解。 它然后关注未知参数案例, 研究扩展值迭代操作员, 包括罗森伯格等人[ 和Tarbouriech等人[31] 使用的现有操作员, 依据l-1 规范和 supremum 规范, 以及定义与KL- diverence 等其他规范和差异相对应的 EVI 操作员。 本文概括地展示了 EVI 操作员与 convex 程序的关系, 以及其双重形式, 展示了很强的双重性。 本文然后重点介绍的是, 在SSP 设置中, 对 Neu 和 Pike- Burke [21] 的有限视野研究中所使用的现有操作员的界限是否适用于这些扩展值迭代操作员。 它表明, 这些操作员与[ 21] 相似的界限存在, 但是它们导致操作员在一般的单质中并不具有, 并且具有更为复杂的组合性, 。 在特定的研究中, 我们观察了这些实验中如何产生进一步的论文的趋同。