Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs). Recently, product quantization (PQ) has been successfully applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. While this property makes PQ an attractive solution for model acceleration, little is understood about the associated trade-offs in terms of compute and memory footprint, and the impact on accuracy. Our empirical study investigates the impact of different PQ settings and training methods on layerwise reconstruction error and end-to-end model accuracy. When studying the efficiency of deploying PQ DNNs, we find that metrics such as FLOPs, number of parameters, and even CPU/GPU performance, can be misleading. To address this issue, and to more fairly assess PQ in terms of hardware efficiency, we design the first custom hardware accelerator to evaluate the speed and efficiency of running PQ models. We identify PQ configurations that are able to improve performance-per-area for ResNet20 by 40%-104%, even when compared to a highly optimized conventional DNN accelerator. Our hardware performance outperforms recent PQ solutions by 4x, with only a 0.6% accuracy degradation. This work demonstrates the practical and hardware-aware design of PQ models, paving the way for wider adoption of this emerging DNN approximation methodology.
翻译:暂无翻译