Several real-world scenarios, such as remote control and sensing, are comprised of action and observation delays. The presence of delays degrades the performance of reinforcement learning (RL) algorithms, often to such an extent that algorithms fail to learn anything substantial. This paper formally describes the notion of Markov Decision Processes (MDPs) with stochastic delays and shows that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure. We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with stochastic delays in actions and observations. The delay-resolved deep Q-network (DRDQN) algorithm is bench-marked on a variety of environments comprising of multi-step and stochastic delays and results in better performance, both in terms of achieving near-optimal rewards and minimizing the computational overhead thereof, with respect to the currently established algorithms.
翻译:远程控制和遥感等几种现实世界情景包括行动和观察延迟。延迟的存在会降低强化学习算法的性能,往往使算法无法学到任何实质性的东西。本文件正式描述了Markov决策程序的概念,并出现了随机延误,并表明延迟的MDP可以(不拖延地)转化为同等的标准MDP,成本结构大大简化。我们利用这一等值来产生一个无模型的延迟解析RL框架,并表明即使基于这一框架的简单RL算法也能在行动和观察出现随机拖延的环境中获得近乎最佳的回报。延迟解决的深Q网络(DRDQN)算法在一系列环境中都有标记,包括多步和随机拖延,并导致更好的业绩,在目前确定的算法方面都取得了近于最佳的奖励,并尽量减少计算成本。