We propose Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time.
翻译:我们建议对勘探和开发(DSEE)算法进行确定性定分法分析,其中结合基于模型的勘探和开发(RL)问题,目的是同时学习系统模型,即Markov决策程序(MDP)和相关的最佳政策。在勘探期间,DSEE对环境进行探索,并更新预期奖励和过渡概率的估计。在开发期间,对预期奖励和过渡概率的最新估计用于获得一个极有可能的稳健政策。我们设计勘探和开发时间的长度,以便累积的遗憾作为时间的子线性函数增长。