Two-level stochastic optimization formulations have become instrumental in a number of machine learning contexts such as neural architecture search, continual learning, adversarial learning, and hyperparameter tuning. Practical stochastic bilevel optimization problems become challenging in optimization or learning scenarios where the number of variables is high or there are constraints. The goal of this paper is twofold. First, we aim at promoting the use of bilevel optimization in large-scale learning and we introduce a practical bilevel stochastic gradient method (BSG-1) that requires neither lower level second-order derivatives nor system solves (and dismisses any matrix-vector products). Our BSG-1 method is close to first-order principles, which allows it to achieve a performance better than those that are not, such as DARTS. Second, we develop bilevel stochastic gradient descent for bilevel problems with lower level constraints, and we introduce a convergence theory that covers the unconstrained and constrained cases and abstracts as much as possible from the specifics of the bilevel gradient calculation.
翻译:两种层次的随机优化配方在神经结构搜索、持续学习、对抗性学习和超分数调制等机器学习环境中变得十分关键。在变量数量高或存在制约因素的优化或学习情景中,实用的随机双层优化问题在优化或双层优化方面变得具有挑战性。本文的目标有两个。首先,我们的目标是在大规模学习中推广双层优化的使用,我们引入一种实用的双层双层随机梯度梯度方法(BSG-1),既不要求低层次的二阶衍生物,也不要求系统解决方案(并开除任何矩阵矢量产品)。我们的BSG-1方法接近于一阶原则,从而使其能够取得比非水平制约的双层梯度梯度下降更好的业绩。第二,我们针对低层次的双层问题开发双层双层随机梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度梯度,我们提出一套理论,并尽可能涵盖于两。