Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding. In this work, we propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in 2.8$\times$ speedup and 2.5$\times$ energy efficiency improvement over the state-of-the-art quantization accelerators.
翻译:量化是一种降低计算和存储数字NN模型的计算和记忆成本的技术,这些模型正在变得越来越大。现有的量化解决方案使用固定点整数或浮动点数据类型,这些类型的好处有限,因为两者都需要更多位子才能保持原始模型的准确性。另一方面,变量宽度对于正常值使用低位量化,对于部分外值则使用高精度。尽管这一行工作带来了算法效益,但它也带来了因变量长度编码和解码而导致的大量硬件间接费用。在这项工作中,我们建议采用固定长度的适应性数字数据类型,称为ANT,以达到低位整数整数或浮点数据类型,这些类型都具有有限的效益,因为两者都需要更多位位元来保持原始模型的准确性。另一方面,变量宽度量化利用两种关键创新,以利用十倍内和倍间适应机会进行正常值的正常值。首先,我们提出一种特定的数据类型,即板块,将浮动的优势和内不同值的重要性结合起来。第二,我们建议一个适应框架,为每个调价美元最佳类型选择每个调的调价元数字数字数字,以达到小硬件顶端的硬度,以显示其设计速度结构的精度结构。我们设计设计,并显示其25度结构的精度结构的精度结构的精度,以其精度结构的精度,以图的精度结构。我们设计一个比。