We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose an information-theoretic lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$. This lower bound involves a non-convex optimization problem, for which we propose a convex relaxation. We further provide an algorithm whose sample complexity matches the relaxed lower bound up to a factor $2$. This algorithm addresses general communicating MDPs; we propose a variant with reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the \emph{generative setting}~\cite{marjani2020adaptive}, where the agent could at each step observe the random outcome of any (state, action) pair. In contrast, we show here how to deal with the \emph{navigation constraints}. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.
翻译:我们调查Markov决定过程中典型的活跃纯勘探问题,在这个过程中,代理人按顺序选择动作,并从由此形成的系统轨迹中,力求尽快确定最佳政策。我们建议对正确回答之前所需的平均步骤数量设定一个信息理论下限,概率至少为1美元。这个下限涉及非convex优化问题,对此我们提议放松。我们进一步提供一种算法,其样本复杂性与放松的较低绑定相匹配,最高为$2美元。这个算法处理一般通信 MDPs;我们提出一个变量,降低勘探率(并因此加快趋同速度),在额外的ergodicity假设下。这项工作扩展了先前与正确回答之前所需的平均步骤数量相比的结果,概率至少为1美元-deltata$。这个下限涉及一个非 convex优化问题,因此我们提议放松对任何(状态、行动)配方的随机结果。我们在此演示如何应对\emph{navigation 限制。我们的分析依赖于在不远的马克索利链中分析。我们所考虑的不远的马克索利的决定。