In this paper we introduce a new approach to discrete-time semi-Markov decision processes based on the sojourn time process. Different characterizations of discrete-time semi-Markov processes are exploited and decision processes are constructed by means of these characterizations. With this new approach, the agent is allowed to consider different actions depending on how much time the process has been in the current state. Numerical method based on $Q$-learning algorithms for finite horizon reinforcement learning and stochastic recursive relations are investigated. We consider a toy example in which the reward depends on the sojourn-time, according to the \textit{gambler's fallacy} and we prove that the underlying process does not generally exhibit the Markov property. Finally, we use this last example to carry on some numerical evaluations on the previously presented $Q$-learning algorithms and on a different method based on deep reinforcement learning.
翻译:在本文中,我们根据逗留时间过程对离散时间半马尔科夫决定程序引入了一种新的方法。对离散时间半马尔科夫进程的不同定性加以利用,并通过这些特征来构建决定程序。有了这一新的方法,代理商可以考虑不同行动,这取决于这一过程在目前状态中已经存在了多少时间。对基于Q$-学习算法的量化方法进行了调查,这些算法用于有限地平线强化学习和随机再生关系。我们考虑了一个微小的例子,根据\ textit{gambler的谬误},奖励取决于逗留时间,我们证明基本程序一般不展示马尔科夫属性。最后,我们用这个最后一个例子对以前提出的以Q$为单位的学习算法和基于深度强化学习的不同方法进行一些数字评估。