A common setting of reinforcement learning (RL) is a Markov decision process (MDP) in which the environment is a stochastic discrete-time dynamical system. Whereas MDPs are suitable in such applications as video-games or puzzles, physical systems are time-continuous. Continuous methods of RL are known, but they have their limitations, such as, e.g., collapse of Q-learning. A general variant of RL is of digital format, where updates of the value and policy are performed at discrete moments in time. The agent-environment loop then amounts to a sampled system, whereby sample-and-hold is a specific case. In this paper, we propose and benchmark two RL methods suitable for sampled systems. Specifically, we hybridize model-predictive control (MPC) with critics learning the Q- and value function. Optimality is analyzed and performance comparison is done in an experimental case study with a mobile robot.
翻译:常见的强化学习设置(RL)是一个Markov决定程序,环境是一个随机离散的离散时间动态系统。虽然MDP适合于视频游戏或拼图等应用,但物理系统是时间性的。人们知道RL的连续方法,但它们有其局限性,例如Q学习的崩溃。RL的一般变量是数字格式,在不连续的时刻对数值和政策进行更新。代理环境循环随后相当于一个抽样系统,其中样本和持有是一个具体案例。在本文件中,我们提议和基准两种RL方法适合抽样系统。具体地说,我们将模型-预知控制(MPC)与批评者学习Q-和值功能的混合。在与移动机器人进行的试验性案例研究中,对优化进行了分析,并进行了绩效比较。