A common setting of reinforcement learning (RL) is a Markov decision process (MDP) in which the environment is a stochastic discrete-time dynamical system. Whereas MDPs are suitable in such applications as video-games or puzzles, physical systems are time-continuous. A general variant of RL is of digital format, where updates of the value (or cost) and policy are performed at discrete moments in time. The agent-environment loop then amounts to a sampled system, whereby sample-and-hold is a specific case. In this paper, we propose and benchmark two RL methods suitable for sampled systems. Specifically, we hybridize model-predictive control (MPC) with critics learning the optimal Q- and value (or cost-to-go) function. Optimality is analyzed and performance comparison is done in an experimental case study with a mobile robot.
翻译:强化学习的通用设置(RL)是一个Markov决定程序,环境是一个随机离散时间动态系统,MDP适合于视频游戏或拼图等应用,而物理系统是时间性的。RL的通用设置是数字格式,其价值(或成本)和政策的更新是在不相干的时刻进行的。代理环境循环随后是一个抽样系统,其中样本和持有是一个具体案例。在本文中,我们提议并基准两种RL方法适合抽样系统。具体地说,我们将模型预测控制(MPC)与批评者混合,学习最佳Q和值(或成本-go)功能。在与移动机器人进行的试验性案例研究中,分析最佳性和性能比较。