Quantization is a technique for reducing deep neural networks (DNNs) training and inference times, which is crucial for training in resource constrained environments or applications where inference is time critical. State-of-the-art (SOTA) quantization approaches focus on post-training quantization, i.e., quantization of pre-trained DNNs for speeding up inference. While work on quantized training exists, most approaches require refinement in full precision (usually single precision) in the final training phase or enforce a global word length across the entire DNN. This leads to suboptimal assignments of bit-widths to layers and, consequently, suboptimal resource usage. In an attempt to overcome such limitations, we introduce AdaPT, a new fixed-point quantized sparsifying training strategy. AdaPT decides about precision switches between training epochs based on information theoretic conditions. The goal is to determine on a per-layer basis the lowest precision that causes no quantization-induced information loss while keeping the precision high enough such that future learning steps do not suffer from vanishing gradients. The benefits of the resulting fully quantized DNN are evaluated based on an analytical performance model which we develop. We illustrate that an average speedup of 1.27 compared to standard training in float32 with an average accuracy increase of 0.98% can be achieved for AlexNet/ResNet on CIFAR10/100 and we further demonstrate these AdaPT trained models achieve an average inference speedup of 2.33 with a model size reduction of 0.52.
翻译:量化是一种减少深神经网络(DNN)培训和推断时间的技术,对于在资源受限环境或应用中进行具有时间重要性的测试,这是在资源受限环境或应用中进行培训的关键。 最先进的(SOTA)量化方法侧重于培训后量化,即对培训前的DNN进行量化,以加快推断速度。 虽然量化培训工作存在,但大多数方法都需要在最后培训阶段以完全精确(通常是单一精确度)进行改进,或者在整个DNN实施一个全球单词长度。这导致在资源受限环境或应用中进行比特宽到层的亚特网化培训,从而实现亚特网的次优化资源使用。为了克服这些局限性,我们引入了AdaPT, 一个新的固定点量化宽松,以加快推断速度。 AdAPT根据信息测试条件决定了培训小节之间的精确度开关。 目标是在每层基础上确定最低的精确度,不会导致量化信息损失,同时保持精确度的比重到层,因此,因此,亚特网络的位网度资源使用也低于亚特。 为了降低平均分析速度,我们所测测测测测测测测得的平均速度,因此,在1年平均的模型中,可以取得平均成绩。