The latest industrial inference engines, such as FasterTransformer1 and TurboTransformers, have verified that half-precision floating point (FP16) and 8-bit integer (INT8) quantization can greatly improve model inference speed. However, the existing FP16 or INT8 quantization methods are too complicated, and improper usage will lead to performance damage greatly. In this paper, we develop a toolkit for users to easily quantize their models for inference, in which a Self-Adaptive Mixed-Precision (SAMP) is proposed to automatically control quantization rate by a mixed-precision architecture to balance efficiency and performance. Experimental results show that our SAMP toolkit has a higher speedup than PyTorch and FasterTransformer while ensuring the required performance. In addition, SAMP is based on a modular design, decoupling the tokenizer, embedding, encoder and target layers, which allows users to handle various downstream tasks and can be seamlessly integrated into PyTorch.
翻译:最新的工业推断引擎,如Apper Transferent1和Turbo Transfents,已经证实半精度浮点(FP16)和8位整数(INT8)的量化可以大大提高模型推导速度。然而,现有的FP16或INT8量化方法过于复杂,不当使用会导致性能损害。在本文中,我们开发了一个工具包,用户可以方便地对其推论模型进行定量分析,其中提议采用自开发混合精度(SAMP),通过混合精度结构自动控制四分法率,以平衡效率和性能。实验结果表明,我们的SAMP工具包比PyTorrch和快速转换工具更快,同时确保所要求的性能。此外,SAMP还基于模块设计,解码器、嵌入、编码和目标层,使用户能够处理各种下游任务,并可以无缝地融入PyTorch。