Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of efficient architecture development. This paper proposes an efficient transformer architecture that adjusts the inference computational cost adaptively with desired inference latency speedup. The proposed encoder model can work with fewer Floating Point Operations (FLOPs) than the original Transformer architecture. In fine-tuning phase, the proposed method detects more important hidden sequence elements (word-vectors) in each encoder layer by a proposed Attention Context Contribution (ACC) metric. It eliminates the less important word-vectors based on a new strategy. A mathematical inference speedup analysis is proposed to estimate the speedup accurately to adjust the latency and computational cost of fine-tuning and inference phases. After the fine-tuning phase, by the method offline-tuning property, the inference latency of the model can be adjusted in a wide range of inference speedup selections. The proposed method is applied to the BERTbase model for evaluation. Extensive experiments show that most of the word-vectors in higher BERT encoder layers have less contribution to the subsequent layers; hence, they can be eliminated to improve the inference latency. Experimental results on extensive sentiment analysis, classification, and regression benchmarks like GLUE showed that the method is effective in various datasets. The proposed method improves the inference latency of BERTbase by up to 4.8 times with less than 0.75% accuracy drop on average.
翻译:调整自然语言理解模型的延缩度、 功率和准确度是高效架构发展的一个理想目标。 本文提出一个高效的变压器结构, 以预期的推推拉加速度调整推算计算成本。 提议的编码模型可以使用比原变换器结构更少的浮点操作( FLOPs) 。 在微调阶段, 拟议的方法通过拟议的注意环境贡献(ACC) 度量来检测每个编码层中更重要的隐藏序列元素( 字动量) 。 它消除了基于新战略的较不重要的字动量。 提议进行数学推断加速分析, 以精确地估计速度以调整精细调和推导阶段的精度和计算成本。 在微调阶段后, 通过离线调整方法, 模型的推导力拉伸缩度可以通过一系列的递增速度选择加以调整。 拟议的方法适用于用于评价的BERTCBase模型。 广泛的实验显示, 大多数的文字递增率水平在更高层次中, 实验中, 高级分析显示, 高级的推算方法在后推算方法中, 显示, 递后推算方法的递性推后推后推后推算法会提高方法的推后推后推后推后推后推后推后推后推后推后推后推后推算法, 。