In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling H2, dynamic Hinf, and constant gain Hinf controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the Hinf controlled learning performs slightly better than Double deep Q-learning.
翻译:在本文中,我们将深入的Q学习放到一个以控制为导向的角度上,并用从强力控制中获得的成熟技术来研究其学习动态。我们通过神经相向内核来设计一个不确定的线性时变模型来描述学习情况。我们展示了学习和分析代理人在频域中的行为的不稳定性。然后,我们通过强力控制器作为损失函数的动态奖励确保了趋同。我们综合了三个控制器:州-反馈收益表H2,动态Hinf,以及恒定的Hinf控制器。用控制性调整方法来设置学习代理器更加透明,并且拥有与强化学习中超常力相比完善的文献。此外,我们的方法并不使用目标网络和随机重现记忆。目标网络的作用被控制输入器所取代,这也利用了样本的时间依赖性(随机缓冲)。在不同OpenAI Gym环境中的Numerical模拟显示,Hinf控制着学习过程比双重深度学习要好一些。