In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling $\mathcal{H}_2$, dynamic $\mathcal{H}_\infty$, and constant gain $\mathcal{H}_\infty$ controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the $\mathcal{H}_\infty$ controlled learning performs slightly better than Double deep Q-learning.
翻译:在本文中, 我们将深Q 学习放到一个以控制为导向的角度, 并用来自强力控制的成熟技术来研究它的学习动态。 我们通过神经相向内核来设计一个不确定的线性时间变化模型来描述学习情况。 我们显示了学习和分析代理人在频域内的行为的不稳定性。 然后, 我们通过强力控制器确保汇合, 在损失函数中作为动态奖赏。 我们合成了三个控制器: 州- 后退增益列表 $\ mathcal{H ⁇ 2$, 动态 $\mathcal{H ⁇ infty$, 以及不断增益 $\ mathcal{H ⁇ infty$。 设置一个以控制性调整法的学习工具, 更加透明, 并且与强化学习的外观相比, 我们的方法没有使用目标网络和随机重现记忆。 目标网络的作用被控制输入所取代, 同时利用样本的时间依赖性( 对随机记忆缓冲) 。 在不同的Opernalalalimalimal imalimalimal imate imactal astial pressation pressment pressation ($GAny) pressalking destalking) 。 在不同的 Oprogregregreg) gA.