We investigate the problems of model estimation and reward-free learning in episodic Block MDPs. In these MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states. We are first interested in estimating the latent state decoding function (the mapping from the observations to latent states) based on data generated under a fixed behavior policy. We derive an information-theoretical lower bound on the error rate for estimating this function and present an algorithm approaching this fundamental limit. In turn, our algorithm also provides estimates of all the components of the MDP. We then study the problem of learning near-optimal policies in the reward-free framework. Based on our efficient model estimation algorithm, we show that we can infer a policy converging (as the number of collected samples grows large) to the optimal policy at the best possible rate. Interestingly, our analysis provides necessary and sufficient conditions under which exploiting the block structure yields improvements in the sample complexity for identifying near-optimal policies. When these conditions are met, the sample complexity in the minimax reward-free setting is improved by a multiplicative factor $n$, where $n$ is the number of possible contexts.
翻译:我们调查了模型估计和无报酬学习在偶发集团 MDP 中的问题。 在这些 MDP 中, 决策者能够了解从少数潜在状态产生的丰富观察或背景。 我们首先有兴趣根据固定行为政策产生的数据来估计潜在的国家解码功能( 从观测到潜在状态的绘图) 。 我们从错误率中得出一个信息理论下限, 以估计这一功能, 并提出接近这一基本限度的算法 。 反过来, 我们的算法还提供MDP 所有组成部分的估计数。 然后, 我们研究在无报酬框架内学习近最佳政策的问题。 我们根据高效的模型估计算法, 显示我们可以以尽可能最佳的速度将政策( 采集的样本数量增长到潜在状态) 整合到最佳政策中。 有趣的是, 我们的分析提供了必要和充分的条件, 在这种条件下, 利用块结构使抽样复杂性得到改善, 以便确定近于最佳的政策。 当这些条件得到满足时, 我们研究在无报酬框架中学习近乎最佳的政策。 我们根据高效的模型估计算法, 显示我们可以将政策( 即收集的样本数量增加) $ 的情况加以改进。</s>