与非标准标记政策一起在延缓环境中采取行动 (Acting in Delayed Environments with Non-Stationary Markov Policies)

The standard Markov Decision Process (MDP) formulation hinges on the assumption that an action is executed immediately after it was chosen. However, assuming it is often unrealistic and can lead to catastrophic failures in applications such as robotic manipulation, cloud computing, and finance. We introduce a framework for learning and planning in MDPs where the decision-maker commits actions that are executed with a delay of $m$ steps. The brute-force state augmentation baseline where the state is concatenated to the last $m$ committed actions suffers from an exponential complexity in $m$, as we show for policy iteration. We then prove that with execution delay, deterministic Markov policies in the original state-space are sufficient for attaining maximal reward, but need to be non-stationary. As for stationary Markov policies, we show they are sub-optimal in general. Consequently, we devise a non-stationary Q-learning style model-based algorithm that solves delayed execution tasks without resorting to state-augmentation. Experiments on tabular, physical, and Atari domains reveal that it converges quickly to high performance even for substantial delays, while standard approaches that either ignore the delay or rely on state-augmentation struggle or fail due to divergence. The code is available at https://github.com/galdl/rl_delay_basic.git.

翻译：标准 Markov 决策程序(MDP) 的制定取决于以下假设: 行动是在选择后立即执行的。但是, 假设它往往不现实, 并可能导致在机器人操纵、云计算和金融等应用中发生灾难性失败。我们引入了在MDP 中学习和规划的框架, 决策者在其中采取行动时拖延了 $ $ 的步伐。因此, 我们设计了一个粗力州增强基准, 将国家与最后承诺的百万美元混为一体, 其复杂性是指数化的, 正如我们为政策循环所显示的。然后, 我们证明, 执行延迟后, 最初州空间的确定性马尔科夫政策足以获得最高奖赏, 但也必须是非静止的。关于固定性马尔科夫政策, 我们显示, 在一般情况下, 决策者实施的行动是次优化的。因此, 我们设计了一个非静止的 Q 学习模式的算法, 解决延迟执行任务, 而不诉诸州级化。在表格、物理和阿塔里域的实验显示, 其执行速度很快会达到高效, 即使是大幅拖延, 但也依靠标准法/ 。在标准错误上, 忽略标准错误错误上, 在错误上, 错误上, 忽略错误或错误。