For deploying a deep learning model into production, it needs to be both accurate and compact to meet the latency and memory constraints. This usually results in a network that is deep (to ensure performance) and yet thin (to improve computational efficiency). In this paper, we propose an efficient method to train a deep thin network with a theoretic guarantee. Our method is motivated by model compression. It consists of three stages. In the first stage, we sufficiently widen the deep thin network and train it until convergence. In the second stage, we use this well-trained deep wide network to warm up (or initialize) the original deep thin network. This is achieved by letting the thin network imitate the immediate outputs of the wide network from layer to layer. In the last stage, we further fine tune this well initialized deep thin network. The theoretical guarantee is established by using mean field analysis, which shows the advantage of layerwise imitation over traditional training deep thin networks from scratch by backpropagation. We also conduct large-scale empirical experiments to validate our approach. By training with our method, ResNet50 can outperform ResNet101, and BERT_BASE can be comparable with BERT_LARGE, where both the latter models are trained via the standard training procedures as in the literature.
翻译:为了在生产过程中部署深层学习模型,它需要既准确又紧凑,以适应潜伏和记忆限制。这通常导致一个深(确保性能)和薄(提高计算效率)的网络。在本文中,我们提出一种有效的方法,用理论保证来训练深薄的网络。我们的方法是由模型压缩驱动的。我们的方法分为三个阶段。在第一阶段,我们充分扩大深薄的网络,并对其进行培训,直到汇合。在第二阶段,我们利用这个经过良好训练的深厚广网络来暖化(或初始化)原始的深薄网络。这通常是通过让薄网络从层到层的立即模仿宽网络产出来实现的。在最后阶段,我们进一步精细调整这个经过良好初始化的深薄网络。理论保证是通过使用中度的实地分析确定的,这显示了层仿照传统深薄网络的优势,而不用反光化来进行。我们还进行了大规模的实验,以验证我们的做法。通过培训,ResNet50可以超越内部的 ResNet101 模型和BERT_BASASASE的训练程序可以与标准文献相比。