We propose Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the system dynamics are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time. We also discuss a method for efficient exploration using multi-hop MDP and Metropolis-Hastings algorithm to uniformly sample each state-action pair with high probability.
翻译:我们提出勘探和开发(DSEE)的确定性测算法,与基于模型的RL问题的勘探和开发交错时代,旨在同时学习系统模型,即Markov决定程序(MDP)和相关的最佳政策。在勘探期间,DSEE对环境进行探索,更新预期奖励和过渡概率的估计。在开发期间,系统动态的最新估计值被用于获得一个非常可能稳健的政策。我们设计勘探和开发时代的长度,以便累积的遗憾作为时间的子线函数增长。我们还讨论使用多霍MDP和Metopolis-Hasting算法进行高效探索的方法,以统一地对每对州行动进行高概率的抽样。