One of the main goals of reinforcement learning (RL) is to provide a~way for physical machines to learn optimal behavior instead of being programmed. However, effective control of the machines usually requires fine time discretization. The most common RL methods apply independent random elements to each action, which is not suitable in that setting. It is not feasible because it causes the controlled system to jerk, and does not ensure sufficient exploration since a~single action is not long enough to create a~significant experience that could be translated into policy improvement. In our view these are the main obstacles that prevent application of RL in contemporary control systems. To address these pitfalls, in this paper we introduce an RL framework and adequate analytical tools for actions that may be stochastically dependent in subsequent time instances. We also introduce an RL algorithm that approximately optimizes a~policy that produces such actions. It applies experience replay to adjust likelihood of sequences of previous actions to optimize expected $n$-step returns the policy yields. The efficiency of this algorithm is verified against four other RL methods (CDAU, PPO, SAC, ACER) in four simulated learning control problems (Ant, HalfCheetah, Hopper, and Walker2D) in diverse time discretization. The algorithm introduced here outperforms the competitors in most cases considered.
翻译:强化学习(RL)的主要目标之一是为物理机器提供学习最佳行为而不是编程的最佳行为的方法。然而,对机器的有效控制通常需要精细的时间分解。最常见的RL方法对每个行动都应用独立随机元素,这在当时情况下是不合适的。它不可行,因为它导致受控制的系统自干,并且没有确保充分的探索,因为一个环单行动的时间不够长,不足以创造可以转化为政策改进的微大经验。我们认为,这些是阻碍在当代控制系统中应用RL的主要障碍。要解决这些陷阱,我们在本文件中为以后可能具有托盘依赖性的行动引入一个RL框架和适当的分析工具。我们还引入了一种RL算法,该算法大约优化了产生这种行动的~政策。它运用了经验来调整以前行动的顺序的可能性,以优化预期的美元分步数返回政策效果。在四个模拟的Hopperal-RL(Ant Ant, PPO, SAC, ACER)中,根据四种不同的RL方法(CD CD)核实了这一算法的效率。