Transformers have greatly advanced the state-of-the-art in Natural Language Processing (NLP) in recent years, but present very large computation and storage requirements. We observe that the design process of Transformers (pre-train a foundation model on a large dataset in a self-supervised manner, and subsequently fine-tune it for different downstream tasks) leads to task-specific models that are highly over-parameterized, adversely impacting both accuracy and inference efficiency. We propose AxFormer, a systematic framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task. AxFormer combines two key optimizations -- accuracy-driven pruning and selective hard attention. Accuracy-driven pruning identifies and removes parts of the fine-tuned transformer that hinder performance on the given downstream task. Sparse hard-attention optimizes attention blocks in selected layers by eliminating irrelevant word aggregations, thereby helping the model focus only on the relevant parts of the input. In effect, AxFormer leads to models that are more accurate, while also being faster and smaller. Our experiments on GLUE and SQUAD tasks show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models. In addition, we demonstrate that AxFormer can be combined with previous efforts such as distillation or quantization to achieve further efficiency gains.
翻译:近年来,变异器大大推进了自然语言处理(NLP)的最新工艺,但提出了非常庞大的计算和存储要求。我们观察到,变异器的设计过程(先在大型数据集上以自我监督的方式对基础模型进行预演,然后对不同下游任务进行微调)导致任务特有模型的高度超分化,对准确性和推断效率产生不利影响。我们提议AxFormer,这是一个系统化框架,采用精确驱动的近似法,为特定下游任务创建优化的变异器模型。AxFormer结合了两个关键优化方法 -- -- 精准驱动的运行和选择性的硬性关注。精确驱动程序确定并删除了微调变异器中妨碍下游任务绩效的部分内容。偏差的硬度使选定层的注意力区优化,消除了无关的单词汇总,从而帮助模型只侧重于输入的相关部分。事实上,AxFormer导致模型更加精确,同时将精准性调整和精准性调整两个关键优化,同时将精准性调整为ALU的进度,同时将精准性测试到ALU和SLU的模型,同时将更精确地显示AX的精度提升到更精确的模型。我们的ALU和SLUx的实验可以更精确地展示的Ax。