We introduce latency-aware network acceleration (LANA) - an approach that builds on neural architecture search techniques and teacher-student distillation to accelerate neural networks. LANA consists of two phases: in the first phase, it trains many alternative operations for every layer of the teacher network using layer-wise feature map distillation. In the second phase, it solves the combinatorial selection of efficient operations using a novel constrained integer linear optimization (ILP) approach. ILP brings unique properties as it (i) performs NAS within a few seconds to minutes, (ii) easily satisfies budget constraints, (iii) works on the layer-granularity, (iv) supports a huge search space $O(10^{100})$, surpassing prior search approaches in efficacy and efficiency. In extensive experiments, we show that LANA yields efficient and accurate models constrained by a target latency budget, while being significantly faster than other techniques. We analyze three popular network architectures: EfficientNetV1, EfficientNetV2 and ResNeST, and achieve accuracy improvement for all models (up to $3.0\%$) when compressing larger models to the latency level of smaller models. LANA achieves significant speed-ups (up to $5\times$) with minor to no accuracy drop on GPU and CPU. The code will be shared soon.
翻译:我们引入了长期认知网络加速(LANA),这种方法以神经结构搜索技术和师生蒸馏技术为基础,以加速神经网络。LANA由两个阶段组成:第一阶段,它利用分层特征图蒸馏法,对教师网络的每一层进行多种备选操作培训;第二阶段,它使用一种新颖的限制性整数线性优化(ILP)方法,解决高效操作的组合选择。ILP带来独特的特性,因为ILP(一)在几秒钟到几分钟内完成NAS,(二)容易满足预算限制,(三)层结构工程,(四)支持一个巨大的搜索空间(10.100美元),在效率和效益方面超过先前的搜索方法。在广泛的实验中,我们显示LANA产生高效和准确的模式受到目标液压预算的限制,但比其他技术要快得多。我们分析了三种流行的网络结构:高效的NetV1、高效率的NetV2和ResNESST,以及所有模型的精确性改进(最高达3.0美元),当压缩模型的精度将很快达到G-PO级的精度。