Goal-oriented Reinforcement Learning, where the agent needs to reach the goal state while simultaneously minimizing the cost, has received significant attention in real-world applications. Its theoretical formulation, stochastic shortest path (SSP), has been intensively researched in the online setting. Nevertheless, it remains understudied when such an online interaction is prohibited and only historical data is provided. In this paper, we consider the offline stochastic shortest path problem when the state space and the action space are finite. We design the simple value iteration-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks. Notably, our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal. We hope our study could help illuminate the fundamental statistical limits of the offline SSP problem and motivate further studies beyond the scope of current consideration.
翻译:面向目标的强化学习,代理机构需要达到目标状态,同时尽量减少成本,在现实世界应用中受到极大关注。其理论提法,即随机短路(SSP),已在网上进行深入研究。然而,当这种在线互动被禁止,仅提供历史数据时,它仍然受到忽视。在本文中,我们认为当国家空间和行动空间是有限的时,离线短路问题。我们设计了简单的基于迭代法的简单价值算法,用于处理离线政策评价和离线政策学习任务。值得注意的是,我们对这些简单算法的分析产生了强有力的依赖实例的界限,可能意味着最坏的界限接近最小的最佳范围。我们希望我们的研究能够帮助阐明离线SSP问题的基本统计界限,并激励在目前审议范围之外的进一步研究。