Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed fbgemm, a high-performance kernel library, from ground up to perform high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
翻译:深层学习模型通常使用单精度(FP32)浮点数据类型来代表激活和重量,但最近一连串的研究工作表明,使用低精度数据类型的计算(FP16、16-位整数、8-位整数、甚至4-或2位整数)足以实现与FP32相同的精确度,而且效率更高。因此,我们设计了一个高性能内核库,即fbgemm,从地面到对当前生成的CPU进行高性能量化推断。 fbgemm通过在运行时使用高性能宝石执行的通用量化操作,以及按特定形状和大小生成的内核代码,实现了效率。该图书馆部署在脸书上,它为我们目前的生产基准提供了超过2x的绩效收益。