Quantization is widely employed in both cloud and edge systems to reduce the memory occupation, latency, and energy consumption of deep neural networks. In particular, mixed-precision quantization, i.e., the use of different bit-widths for different portions of the network, has been shown to provide excellent efficiency gains with limited accuracy drops, especially with optimized bit-width assignments determined by automated Neural Architecture Search (NAS) tools. State-of-the-art mixed-precision works layer-wise, i.e., it uses different bit-widths for the weights and activations tensors of each network layer. In this work, we widen the search space, proposing a novel NAS that selects the bit-width of each weight tensor channel independently. This gives the tool the additional flexibility of assigning a higher precision only to the weights associated with the most informative features. Testing on the MLPerf Tiny benchmark suite, we obtain a rich collection of Pareto-optimal models in the accuracy vs model size and accuracy vs energy spaces. When deployed on the MPIC RISC-V edge processor, our networks reduce the memory and energy for inference by up to 63% and 27% respectively compared to a layer-wise approach, for the same accuracy.
翻译:云层和边缘系统广泛采用量化方法,以减少深神经网络的内存性、延缓力和能量消耗,特别是混合精密度量度,即对网络不同部分使用不同比特维度,显示可带来极好的增效,精确度下降有限,特别是以自动神经结构搜索工具(NAS)确定的最佳比特维度任务,从而优化比特维度任务。从层层来看,最先进的混合精度工作状态使用不同比特维度的重量,激活每个网络层的电压。在这项工作中,我们扩大了搜索空间,提出了一个新的NAS,独立选择每个重量维特的比特度。这为工具提供了更大的灵活性,仅对与信息最丰富的特征相关的重量给予更高的精确度。测试了MLPerf 小型精度基准套件,我们获得了大量精度模型的精度模型集,激活了每个网络层的电压。我们扩大了搜索空间,提出了新的NAS,选择了每个重量维特维特维特维度的比重度,同时运用了27PIC网络的精度和精确度,从而将Merview-ricreto-rial-rial-ration-ration-ration-rational-ration 分别用于27Vlation-vical-cal-view-view-vical-vil-viewcal as-vical as-view-viewal as-vil-vil as-vil as-vil-vical-vical-vical-vical-vical-vical-vical-vical-vical-vical-vicl-vical-vical 。