Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.
翻译:深层学习是近些年来最具破坏性的技术进步之一。深层学习模型的高性能是以高计算、存储和动力要求为代价的。感知加速和压缩这些模型的迫切需要,以提高在设备上的性能。我们引入了Deeplite Neutrino,用于模型的生产优化,就绪,在基于武器的平台上部署超低位四分化模型。我们为Armv7 和 Armv8 结构实施了低层次的量化内核,以便能够在32位和64位基于武器的庞大系列中部署。通过采用传导、平行和铺设,我们实现了高效的实施,与TensorFlow Lite和XNNPACK在分类和探测模型上的后端相比,我们分别实现了高达2x和2.2x的加速。我们还实现了与ONNX的分类和检测模型运行时间相比,分别达到高达5x和3.2x的重大加速度。