In this paper, we consider the linear programming (LP) formulation for deep reinforcement learning. The number of the constraints depends on the size of state and action spaces, which makes the problem intractable in large or continuous environments. The general augmented Lagrangian method suffers the double-sampling obstacle in solving the LP. Namely, the conditional expectations originated from the constraint functions and the quadratic penalties in the augmented Lagrangian function impose difficulties in sampling and evaluation. Motivated from the updates of the multipliers, we overcome the obstacles in minimizing the augmented Lagrangian function by replacing the intractable conditional expectations with the multipliers. Therefore, a deep parameterized augment Lagrangian method is proposed. Furthermore, the replacement provides a promising breakthrough to integrate the two steps in the augmented Lagrangian method into a single constrained problem. A general theoretical analysis shows that the solutions generated from a sequence of the constrained optimizations converge to the optimal solution of the LP if the error is controlled properly. A theoretical analysis on the quadratic penalty algorithm under neural tangent kernel setting shows the residual can be arbitrarily small if the parameter in network and optimization algorithm is chosen suitably. Preliminary experiments illustrate that our method is competitive to other state-of-the-art algorithms.
翻译:在本文中,我们考虑了用于深加学习的线性编程(LP)配方(LP)配方。限制数量取决于国家和行动空间的规模,这使得问题在大或连续的环境中难以解决。拉格朗加普遍扩大的拉格朗加法在解决LP时遇到双重抽样障碍。也就是说,限制功能和拉格朗加法增强的拉格朗加法中的附带期望产生的有条件期望在取样和评价方面造成困难。从乘数更新开始,我们克服了通过用乘数取代棘手的有条件期望来尽量减少拉格朗加函数的障碍。因此,提出了深度参数化拉格朗加法。此外,这一替换提供了一次大有希望的突破,将拉格朗加法中的两个步骤纳入单一的受限制问题。一般理论分析表明,如果对错误进行适当控制,从限制优化顺序产生的解决办法会与LP的优化最佳解决办法汇合在一起。对内核调调调调的二次等值算法的理论分析表明,如果网络和优化算法的参数能够正确地说明我们所选择的竞争性算法的参数,那么剩余可以是任意的。