We propose a novel low-rank initialization framework for training low-rank deep neural networks -- networks where the weight parameters are re-parameterized by products of two low-rank matrices. The most successful prior existing approach, spectral initialization, draws a sample from the initialization distribution for the full-rank setting and then optimally approximates the full-rank initialization parameters in the Frobenius norm with a pair of low-rank initialization matrices via singular value decomposition. Our method is inspired by the insight that approximating the function corresponding to each layer is more important than approximating the parameter values. We provably demonstrate that there is a significant gap between these two approaches for ReLU networks, particularly as the desired rank of the approximating weights decreases, or as the dimension of the inputs to the layer increases (the latter point holds when the network width is super-linear in dimension). Along the way, we provide the first provably efficient algorithm for solving the ReLU low-rank approximation problem for fixed parameter rank $r$ -- previously, it was unknown that the problem was computationally tractable to solve even for rank $1$. We also provide a practical algorithm to solve this problem which is no more expensive than the existing spectral initialization approach, and validate our theory by training ResNet and EfficientNet models (He et al., 2016; Tan & Le, 2019) on ImageNet (Russakovsky et al., 2015).
翻译:我们提出一个新的低级别初始化框架,用于培训低级别深神经网络 -- -- 其重量参数由两个低级别矩阵产品重新校准的网络。最成功的现有方法是光谱初始化,从全级设置的初始化分布中抽取样本,然后以最佳的方式将Frobenius规范的全级初始化参数与一对通过单值分解的低级别初始化矩阵相匹配。我们的方法受到以下认识的启发:接近与每个层相对应的网络功能比相对应的参数值相近更为重要。我们可以肯定地证明,这两种方法在RELU网络中存在巨大的差距,特别是由于理想的相近性重量级分布在全级设置中,或者随着对层增加的投入的层面(当网络宽度在维度上为超线性值时,后一个点会维持着) 。我们提供了第一个非常有效的算法算法算法,用于解决2015年标值的RELU低级别近似问题 -- -- 此前,我们不知道这两种方法在ReLU网络网络网络网络网络网络网络网络网络网络网络上存在很大的差距上存在很大差距, 。我们最初的理论也无法算算出一个比现在更可解决的平级的平级的平级的平级。