Compressing deep neural networks while maintaining accuracy is important when we want to deploy large, powerful models in production and/or edge devices. One common technique used to achieve this goal is knowledge distillation. Typically, the output of a static pre-defined teacher (a large base network) is used as soft labels to train and transfer information to a student (or smaller) network. In this paper, we introduce Adjoined Networks, or AN, a learning paradigm that trains both the original base network and the smaller compressed network together. In our training approach, the parameters of the smaller network are shared across both the base and the compressed networks. Using our training paradigm, we can simultaneously compress (the student network) and regularize (the teacher network) any architecture. In this paper, we focus on popular CNN-based architectures used for computer vision tasks. We conduct an extensive experimental evaluation of our training paradigm on various large-scale datasets. Using ResNet-50 as the base network, AN achieves 71.8% top-1 accuracy with only 1.8M parameters and 1.6 GFLOPs on the ImageNet data-set. We further propose Differentiable Adjoined Networks (DAN), a training paradigm that augments AN by using neural architecture search to jointly learn both the width and the weights for each layer of the smaller network. DAN achieves ResNet-50 level accuracy on ImageNet with $3.8\times$ fewer parameters and $2.2\times$ fewer FLOPs.
翻译:当我们希望在生产和/或边缘设备中部署大型、强大模型和/或边缘设备时,在保持准确性的同时压缩深层神经网络十分重要。 实现这一目标的一个共同技术是知识蒸馏。 通常, 静态预设教师(大型基础网络)的产出被用作软标签, 用于培训和向学生(或较小)网络传递信息。 在本文中, 我们引入了Adjoined网络, 或AN, 这是一种学习模式, 既培训原始基础网络, 也培训较小的压缩网络。 在我们的培训方法中, 小网络的参数在基地和压缩网络中共享。 我们可以同时使用培训模式, 压缩学生网络(学生网络) 和规范(教师网络) 任何架构。 在本文中, 我们侧重于用于计算机愿景任务的流行的CNN架构。 我们对各种大型数据集的培训模式进行广泛的实验性评估。 使用ResNet-50作为基础网络, 仅达到1. 8M 级最高1 级的精度和1.6 GFLLOP 在图像网络设置中共享。 我们进一步提议, 以可变式 Adroimalal commal Instreal network, 将每级的精度的精度提升Adal AS- hestal train 10: