Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques -- mainly FP16/INT8 variants with static and dynamic schemes -- using the MLPerf Edge Inference benchmarking methodology. The study is conducted on Intel x86 processors and Raspberry Pi device with ARM processor. The paper uses a number of DNN inference frameworks, including OpenVINO (for Intel CPUs only), TensorFlow Lite (TFLite), ONNX, and PyTorch with MobileNetV2, VGG-19, and DenseNet-121. The single-stream, multi-stream, and offline scenarios of the MLPerf Edge Inference benchmarks are used for measuring latency and throughput in our experiments. Our evaluation reveals that OpenVINO and TFLite are the most optimized frameworks for Intel CPUs and Raspberry Pi device, respectively. We observe no loss in accuracy except for the static quantization techniques. We also observed the benefits of using quantization for these optimized frameworks. For example, INT8-based quantized models deliver $3.3\times$ and $4\times$ better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario. To the best of our knowledge, this paper is the first one that presents a unique characterization study characterizing the impact of quantization for a range of DNN inference frameworks -- including OpenVINO, TFLite, PyTorch, and ONNX -- on Intel x86 processors and Raspberry Pi device with ARM processor using the MLPerf Edge Inference benchmark methodology.
翻译:量化是深神经网络(DNN)用来缩小模型规模和通过利用本地硬件改进总体数字性能的一种流行技术。本文试图对使用量化技术的好处进行详细的业绩描述 -- -- 主要是FP16/INT8变量,配有静态和动态计划 -- -- 使用MLPerf Everge Inference 基准方法。该研究用ARM处理器对Intel x86处理器和 Raspberry DPi 设备进行。该论文使用一些DNN的推断框架,包括OpenVINO(仅针对 Intel CPUs)、TensorFlow Lite(TFLite)、ONNNX和PyTorricht(使用MGL-19和DenseNet-121),使用MLPerfreielfe的单流、多流和离线性设定基准来测量我们的实验中的惯性调量和通量。我们的评估显示,OpVINO和TLite是Indferal 最优化的In化框架,用于Intal CPPrial 和Rireal Pireal 也观测,使用最佳的Odal 。我们观察到的Orent StPial 的智能模型,使用最佳的ILealde 和Slationslationslations 。</s>