In this paper we introduce a new approach to discrete-time semi-Markov decision processes based on the sojourn time process. Different characterizations of discrete-time semi-Markov processes are exploited and decision processes are constructed by their means. With this new approach, the agent is allowed to consider different actions depending also on the sojourn time of the process in the current state. A numerical method based on $Q$-learning algorithms for finite horizon reinforcement learning and stochastic recursive relations is investigated. Finally, we consider two toy examples: one in which the reward depends on the sojourn-time, according to the gambler's fallacy; the other in which the environment is semi-Markov even if the reward function does not depend on the sojourn time. These are used to carry on some numerical evaluations on the previously presented $Q$-learning algorithm and on a different naive method based on deep reinforcement learning.
翻译:在本文中,我们根据逗留时间过程对离散时间半马尔科夫决定程序引入了一种新的方法。对离散时间半马尔科夫进程的不同定性加以利用,并用其手段构建了决定程序。有了这一新的方法,代理商可以考虑不同行动,也取决于当前状态下该过程的逗留时间。调查了一种基于Q$-学习算法的数值方法,用于有限地平线强化学习和随机再生关系。最后,我们考虑了两个微小的例子:一个是奖赏取决于离散时间,根据赌徒的谬误;另一个是环境是半马尔科夫,即使奖赏功能并不取决于逗留时间。这些方法用来对以前提出的以Q$为单位的学习算法和基于深度强化学习的不同天性方法进行一些数字评估。