We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference. We leverage weight normalization as a means of constraining parameters during training using accumulator bit width bounds that we derive. We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline. We then show that this reduction translates to increased design efficiency for custom FPGA-based accelerators. Finally, we show that our algorithm not only constrains weights to fit into an accumulator of user-defined bit width, but also increases the sparsity and compressibility of the resulting weights. Across all of our benchmark models trained with 8-bit weights and activations, we observe that constraining the hidden layers of quantized neural networks to fit into 16-bit accumulators yields an average 98.2% sparsity with an estimated compression rate of 46.5x all while maintaining 99.2% of the floating-point performance.
翻译:我们引入了量化认知培训算法, 保证在降低计算器精确度时避免数字溢出, 在推断过程中降低累积器精确度时避免数字溢出。 我们利用重量正常化作为使用我们得出的累积器点宽度界限进行训练期间限制参数的手段。 我们评估了我们为不同任务培训的多种量化模型的算法, 表明我们的方法可以降低累积器的精确度, 同时保持浮点基线的模型精确度。 我们然后表明, 降低率可以提高定制的FPGA加速器的设计效率 。 最后, 我们显示, 我们的算法不仅限制重量, 以适合用户定义宽度的累积器, 而且还提高了由此产生的重量的宽度和可压缩性 。 在经过8位重量和激活训练的所有基准模型中, 我们观察到, 限制四分化神经网络的隐藏层, 以适合16位加速器 。 我们观察到, 平均98.2% 的储量为平均98.2%, 估计压缩率为46.x, 同时保持了浮点性性性能的99.2% 。