This paper proposes a differentiable robust LQR layer for reinforcement learning and imitation learning under model uncertainty and stochastic dynamics. The robust LQR layer can exploit the advantages of robust optimal control and model-free learning. It provides a new type of inductive bias for stochasticity and uncertainty modeling in control systems. In particular, we propose an efficient way to differentiate through a robust LQR optimization program by rewriting it as a convex program (i.e. semi-definite program) of the worst-case cost. Based on recent work on using convex optimization inside neural network layers, we develop a fully differentiable layer for optimizing this worst-case cost, i.e. we compute the derivative of a performance measure w.r.t the model's unknown parameters, model uncertainty and stochasticity parameters. We demonstrate the proposed method on imitation learning and approximate dynamic programming on stochastic and uncertain domains. The experiment results show that the proposed method can optimize robust policies under uncertain situations, and are able to achieve a significantly better performance than existing methods that do not model uncertainty directly.
翻译:本文建议了一种可用于在模型不确定性和随机动态下强化学习和模仿学习的可区别的稳健 LQR 层。 稳健的 LQR 层可以利用稳健的最佳控制和无模式学习的优势。 它为控制系统中的随机性和不确定性建模提供了新型的诱导偏差。 特别是,我们建议了一种有效的方法,通过一个稳健的 LQR 优化程序将其重新写成最坏成本的 convex 程序( 即半确定程序) 来区分。 根据最近关于使用神经网络层内螺旋优化的工作,我们开发了一个完全可区分的层,以优化这一最坏情况的成本,即我们计算出该模型未知参数、模型不确定性和相形色度参数的衍生物。 我们展示了拟议的模拟学习方法,并近似地和不确定域的动态规划。 实验结果表明,在不确定情况下,拟议的方法可以优化稳健政策,并且能够实现比现有方法更好的业绩,而不是直接模拟不确定性的方法。