Deep neural networks (DNNs) have achieved state-of-the-art performance across a variety of traditional machine learning tasks, e.g., speech recognition, image classification, and segmentation. The ability of DNNs to efficiently approximate high-dimensional functions has also motivated their use in scientific applications, e.g., to solve partial differential equations (PDE) and to generate surrogate models. In this paper, we consider the supervised training of DNNs, which arises in many of the above applications. We focus on the central problem of optimizing the weights of the given DNN such that it accurately approximates the relation between observed input and target data. Devising effective solvers for this optimization problem is notoriously challenging due to the large number of weights, non-convexity, data-sparsity, and non-trivial choice of hyperparameters. To solve the optimization problem more efficiently, we propose the use of variable projection (VarPro), a method originally designed for separable nonlinear least-squares problems. Our main contribution is the Gauss-Newton VarPro method (GNvpro) that extends the reach of the VarPro idea to non-quadratic objective functions, most notably, cross-entropy loss functions arising in classification. These extensions make GNvpro applicable to all training problems that involve a DNN whose last layer is an affine mapping, which is common in many state-of-the-art architectures. In our four numerical experiments from surrogate modeling, segmentation, and classification GNvpro solves the optimization problem more efficiently than commonly-used stochastic gradient descent (SGD) schemes. Also, GNvpro finds solutions that generalize well, and in all but one example better than well-tuned SGD methods, to unseen data points.
翻译:深神经网络(DNNs)在各种传统机器学习任务(例如语音识别、图像分类和分割)中取得了最先进的业绩。 DNNs 高效接近高维功能的能力也促使它们在科学应用中使用,例如,解决部分差异方程式(PDE)和生成代位模型。在本文中,我们考虑对DNs的监督培训,这在很多上述应用中产生。我们侧重于优化给定的DNN的权重的中心问题,以便准确估计所观测到的投入和目标数据之间的关系。为这一优化问题设计有效的解决方案是臭名昭著的,因为其权重、非共性、数据分化和非重度功能都被用于科学应用。为了更有效地解决优化问题,我们建议使用变量预测(VarProProProcial),这是最初设计为分解非线性非线性规则的分解方法,我们的主要贡献是,在Oral-Veral 高级分类法中,使Oral-ral-ral-ral-al-lad-lad-lax the lax a lags far-dal-dal-dal-dal-sleval-dal maslation maisl) maisl-s disl dislations disl) mas disl disl disl disl disl disl disl disml disl disl disml disml disml disl disl disl disl disl disl dsl dsl dsl disl disl disl disldsl disldsldsdsdsdsdsdsl dsl dsl dsl dsl dsl dsldsl dsldsldsldsldsldsldslddaldaldaldaldddddddddddddddddddddddddddddddddddddddddddslddddddddddddddds