We consider reinforcement learning for continuous-time Markov decision processes (MDPs) in the infinite-horizon, average-reward setting. In contrast to discrete-time MDPs, a continuous-time process moves to a state and stays there for a random holding time after an action is taken. With unknown transition probabilities and rates of exponential holding times, we derive instance-dependent regret lower bounds that are logarithmic in the time horizon. Moreover, we design a learning algorithm and establish a finite-time regret bound that achieves the logarithmic growth rate. Our analysis builds upon upper confidence reinforcement learning, a delicate estimation of the mean holding times, and stochastic comparison of point processes.
翻译:我们考虑在无限和平均回报环境下加强持续时间的Markov决策程序(MDPs ) 学习。 与离散时间的MDPs相比,一个连续时间过程向一个状态移动,并在采取行动后随机停留一段时间。 在未知的过渡概率和指数持有时间的速率下,我们得出了在时间范围上对调的基于实例的遗憾下限。 此外,我们设计了一种学习算法,并建立了实现对数增长率的限定时间的遗憾界限。 我们的分析建立在高度信心强化学习、对平均持有时间的微妙估计以及点的随机比较的基础上。