We study reinforcement learning for continuous-time Markov decision processes (MDPs) in the finite-horizon episodic setting. We present a learning algorithm based on the methods of value iteration and upper confidence bound. We derive an upper bound on the worst-case expected regret for the proposed algorithm, and establish a worst-case lower bound, both bounds are of the order of square-root on the number of episodes. Finally, we conduct simulation experiments to illustrate the performance of our algorithm.
翻译:我们研究在有限偏顺偶发环境中持续时间的Markov决策程序(MDPs)的强化学习。我们根据价值迭代和上层信心约束的方法提出一种学习算法。我们从最坏情况下获得对拟议算法的预期遗憾,并建立了最坏情况下较低的界限,两者的界限都是关于事件数量的平方根顺序。最后,我们进行模拟实验,以说明我们的算法的性能。