For deploying deep learning models to lower end devices, it is necessary to train less resource-demanding variants of state-of-the-art architectures. This does not eliminate the need for more expensive models as they have a higher performance. In order to avoid training two separate models, we show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We extend on prior methods that focused only on core networks of smaller width, while we focus on supporting arbitrary core network architectures. Our proposed training scheme switches consecutively between optimizing only the core part of the network and the full one. The accuracy of the full model remains comparable, while the core network achieves better performance than when it is trained in isolation. In particular, we show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone. We analyze our training scheme theoretically, and show its convergence under assumptions that are either standard or practically justified. Moreover, we show that the developed theoretical framework allows analyzing many other partial training schemes for neural networks.
翻译:为了部署深层次学习模型以降低末端装置,有必要培训资源需求较少的先进建筑变体。这并不能消除更昂贵模型的需求,因为它们的性能较高。为避免培训两个不同的模型,我们证明可以对神经网络进行培训,使预先定义的“核心”子网络能够与经过培训的全网络分离,其性能优异。我们推广以前仅侧重于较小宽度核心网络的核心网络的方法,同时我们侧重于支持任意的核心网络结构。我们提议的培训计划在优化网络核心部分与完整网络之间相切换。完整模型的准确性仍然可比,而核心网络的准确性优于孤立培训时。特别是,我们显示,对低级核心的变异器的培训模式优于仅培训低级模型的低级模式。我们从理论上分析了我们的培训计划,并在标准或实际合理的假设下显示了其趋同性。此外,我们显示,发达的理论框架允许分析神经网络的许多其他部分培训计划。