深度GEMM：使用查找表在CPU架构上加速超低精度推断 (DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables)

Darshan C. Ganji,Saad Ashfaq,Ehsan Saboori,Sudhakar Sah,Saptarshi Mitra,MohammadHossein AskariHemmat,Alexander Hoffman,Ahmed Hassanien,Mathieu Léonardon

A lot of recent progress has been made in ultra low-bit quantization, promising significant improvements in latency, memory footprint and energy consumption on edge devices. Quantization methods such as Learned Step Size Quantization can achieve model accuracy that is comparable to full-precision floating-point baselines even with sub-byte quantization. However, it is extremely challenging to deploy these ultra low-bit quantized models on mainstream CPU devices because commodity SIMD (Single Instruction, Multiple Data) hardware typically supports no less than 8-bit precision. To overcome this limitation, we propose DeepGEMM, a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware. The proposed method precomputes all possible products of weights and activations, stores them in a lookup table, and efficiently accesses them at inference time to avoid costly multiply-accumulate operations. Our 2-bit implementation outperforms corresponding 8-bit integer kernels in the QNNPACK framework by up to 1.74x on x86 platforms.

翻译：近年来，超低位量化取得了很多进展，在边缘设备上承诺显著改善延迟、内存占用和能源消耗。像 Learned Step Size Quantization 这样的量化方法可以在子字节量化下实现可与完全精度浮点基线相当的模型准确性。然而，在主流CPU设备上部署这些超低位量化模型非常具有挑战性，因为通用的单指令多数据（ SIMD ）硬件通常支持至少8位精度。为了克服这个限制，我们提出了 DeepGEMM，一种基于查找表的方法，用于在 SIMD 硬件上执行超低精度卷积神经网络。所提出的方法预先计算所有可能的权重和激活的乘积，将它们存储在一个查找表中，并在推断时高效地访问它们，以避免昂贵的乘积累加运算。我们的2位实现在x86平台上比相应的 QNNPACK 框架中的8位整数内核性能提高了高达1.74倍。

相关内容

查找表

关注 0

在计算机科学中，查找表是一个用更简单的数组索引操作代替运行时计算的数组。在处理时间方面的节省是可观的，因为从存储器中检索值通常比进行“昂贵”的计算或输入/输出操作要快。这些表可以预先计算并存储在静态程序存储中，作为程序初始化阶段（内存化）的一部分进行计算（或“预取”），甚至可以存储在特定于应用程序平台中的硬件中。查找表还广泛用于通过与数组中的有效（或无效）项列表进行匹配来验证输入值，并且在某些编程语言中，查找表可能包含指针函数（或标签偏移量）以处理匹配的输入。 FPGA还广泛使用可重新配置的，硬件实现的查找表，以提供可编程的硬件功能。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【ACL2020】DeeBERT:动态加速BERT推理，DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

专知会员服务

21+阅读 · 2020年4月30日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日