We investigate the problem of best policy identification in discounted linear Markov Decision Processes in the fixed confidence setting under a generative model. We first derive an instance-specific lower bound on the expected number of samples required to identify an $\varepsilon$-optimal policy with probability $1-\delta$. The lower bound characterizes the optimal sampling rule as the solution of an intricate non-convex optimization program, but can be used as the starting point to devise simple and near-optimal sampling rules and algorithms. We devise such algorithms. One of these exhibits a sample complexity upper bounded by ${\cal O}({\frac{d}{(\varepsilon+\Delta)^2}} (\log(\frac{1}{\delta})+d))$ where $\Delta$ denotes the minimum reward gap of sub-optimal actions and $d$ is the dimension of the feature space. This upper bound holds in the moderate-confidence regime (i.e., for all $\delta$), and matches existing minimax and gap-dependent lower bounds. We extend our algorithm to episodic linear MDPs.
翻译:我们在一个基因模型下调查了在固定信任设置中采用折扣线性马可夫决定程序的最佳政策识别问题。 我们首先对确定美元- 最佳政策的预期样本数量得出一个针对具体实例的较低约束值, 概率为1美元- delta美元。 较低约束值将最佳采样规则定性为复杂非康韦克斯优化程序的解决方案, 但可以用作制定简单和接近最佳采样规则和算法的起点。 我们设计了这种算法。 其中一种算法展示了由$( cale O) (frac) (\\ varepsilon ⁇ Delta) (\ log (\\ frac{ 1\ delta}+d) ) (\ log (\\\ frac{ 1\\ delta}+d)) ($\ Delta ) (log (\ Delta $) 表示亚康美行动和美元最小的奖励差距是特征空间的维度。 这种高度约束在中( $\delta $) 系统中( ), 和与现有的迷你和差距依赖的M- deal DP 伸缩缩式算法。