Transformer-based language models have become the standard approach to solving natural language processing tasks. However, industry adoption usually requires the maximum throughput to comply with certain latency constraints that prevents Transformer models from being used in production. To address this gap, model compression techniques such as quantization and pruning may be used to improve inference efficiency. However, these compression techniques require specialized software to apply and deploy at scale. In this work, we propose a new pipeline for creating and running Fast Transformer models on CPUs, utilizing hardware-aware pruning, knowledge distillation, quantization, and our own Transformer inference runtime engine with optimized kernels for sparse and quantized operators. We demonstrate the efficiency of our pipeline by creating a Fast DistilBERT model showing minimal accuracy loss on the question-answering SQuADv1.1 benchmark, and throughput results under typical production constraints and environments. Our results outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50% and up to 4.1x performance speedup over ONNX Runtime. Source code is publicly available at https://github.com/intel/intel-extension-for-transformers.
翻译:以变压器为基础的语言模型已成为解决自然语言处理任务的标准方法。然而,产业的采用通常要求最高输送量,以达到某些阻碍生产使用变压器模型的潜伏限制。为解决这一差距,可以使用模型压缩技术,如量化和剪裁等模型压缩技术来提高推论效率。然而,这些压缩技术需要专门软件,才能大规模应用和部署。在这项工作中,我们提议了一个新的管道,用于在CPU上创建和运行快速变压器模型,利用硬件觉察、知识蒸馏、量化和我们自己的变压器导导运行时引擎,为稀释和变压机操作操作者提供优化的内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内有可使用。