Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and long inference delay prevent Transformer-based pretrained language models (PLMs) from seeing broader adoption including for edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression and acceleration for pretrained language models, including benchmarks, metrics and methodology.
翻译:尽管在许多国家语言方案任务中取得了最先进的成绩,但高能源成本和长期的推论延迟使基于变异器的预先培训语言模式无法被更广泛地采用,包括用于边际和移动计算。有效的国家语言方案研究旨在全面考虑国家语言方案整个生命周期的计算、时间和碳排放,包括数据编制、示范培训和推理。在这次调查中,我们侧重于推论阶段,并审查包括基准、指标和方法在内的预先培训语言模式的压缩和加速现状。