Model quantization enables the deployment of deep neural networks under resource-constrained devices. Vector quantization aims at reducing the model size by indexing model weights with full-precision embeddings, i.e., codewords, while the index needs to be restored to 32-bit during computation. Binary and other low-precision quantization methods can reduce the model size up to 32$\times$, however, at the cost of a considerable accuracy drop. In this paper, we propose an efficient framework for ternary quantization to produce smaller and more accurate compressed models. By integrating hyperspherical learning, pruning and reinitialization, our proposed Hyperspherical Quantization (HQ) method reduces the cosine distance between the full-precision and ternary weights, thus reducing the bias of the straight-through gradient estimator during ternary quantization. Compared with existing work at similar compression levels ($\sim$30$\times$, $\sim$40$\times$), our method significantly improves the test accuracy and reduces the model size.
翻译:模型定量化使深神经网络能够在资源受限制的装置下部署。矢量化的目的是通过将带有全精度嵌入的模型重量(即代码字)指数化,从而缩小模型的大小,使模型的尺寸降低到32比特,而在计算过程中,指数需要恢复到32比特。二进制和其他低精度定量化方法可以降低模型的大小,以相当精确的下降为代价,将模型的大小降低到32美元。在本文中,我们提议了一个高效的永久定量化框架,以产生更小、更精确的压缩模型。通过整合超球学、运行和重新初始化,我们拟议的超球量化(HQ)方法降低了全精度和裁量重量之间的正弦距离,从而降低了在裁量期间直通梯度估计器的偏差。与类似压缩水平的现有工作相比,我们的方法大大改进了测试的精确度并缩小了模型的大小。