This paper discusses a new approach to the fundamental problem of learning optimal Q-functions. In this approach, optimal Q-functions are formulated as saddle points of a nonlinear Lagrangian function derived from the classic Bellman optimality equation. The paper shows that the Lagrangian enjoys strong duality, in spite of its nonlinearity, which paves the way to a general Lagrangian method to Q-function learning. As a demonstration, the paper develops an imitation learning algorithm based on the duality theory, and applies the algorithm to a state-of-the-art machine translation benchmark. The paper then turns to demonstrate a symmetry breaking phenomenon regarding the optimality of the Lagrangian saddle points, which justifies a largely overlooked direction in developing the Lagrangian method.
翻译:本文讨论了学习最佳Q功能这一根本问题的新办法。 在这个办法中, 将最佳Q功能设计成源自经典的贝尔曼最佳等式的非线性拉格朗吉亚函数的支撑点。 该文件表明,拉格朗吉亚人有着很强的双重性, 尽管其非线性为学习通用拉格朗吉亚方法铺平了道路, 也为学习Q功能铺平了道路。 作为示范, 该文件开发了基于双重理论的仿造学习算法, 并将算法应用到最先进的机器翻译基准中。 然后, 该文件转而展示了拉格朗吉亚顶点最佳性的一个对称断现象, 这证明在开发拉格朗吉亚方法时有理由忽略了方向。