Transformer-based deep learning models have increasingly demonstrated high accuracy on many natural language processing (NLP) tasks. In this paper, we propose a compression-compilation co-design framework that can guarantee the identified model to meet both resource and real-time specifications of mobile devices. Our framework applies a compiler-aware neural architecture optimization method (CANAO), which can generate the optimal compressed model that balances both accuracy and latency. We are able to achieve up to 7.8x speedup compared with TensorFlow-Lite with only minor accuracy loss. We present two types of BERT applications on mobile devices: Question Answering (QA) and Text Generation. Both can be executed in real-time with latency as low as 45ms. Videos for demonstrating the framework can be found on https://www.youtube.com/watch?v=_WIRvK_2PZI
翻译:以变换器为基础的深层学习模型在许多自然语言处理(NLP)任务上日益显示出高度精准性。在本文件中,我们提议一个压缩-汇编共同设计框架,以保障所确定的模型既符合移动设备的资源和实时规格,又符合移动设备的资源和实时规格。我们的框架采用一个编译器-天能神经结构优化方法(CANAO),它可以产生最佳压缩模型,既平衡准确性又平衡延缓性。与TensorFlow-Lite相比,我们能够达到7.8x的加速度,而TensorFlow-Lite只有少量精度损失。我们在移动设备上提出了两种类型的BERT应用程序:问答(QA)和文本生成。两者都可以在低至45米的宽度下实时执行。演示框架的视频可以在https://www.youtube.com/watch?v ⁇ WIRvK_2PZI上找到。