Fine-tuning pre-trained language models (PLMs) achieves impressive performance on a range of downstream tasks, and their sizes have consequently been getting bigger. Since a different copy of the model is required for each task, this paradigm is infeasible for storage-constrained edge devices like mobile phones. In this paper, we propose SPARTAN, a parameter efficient (PE) and computationally fast architecture for edge devices that adds hierarchically organized sparse memory after each Transformer layer. SPARTAN freezes the PLM parameters and fine-tunes only its memory, thus significantly reducing storage costs by re-using the PLM backbone for different tasks. SPARTAN contains two levels of memory, with only a sparse subset of parents being chosen in the first level for each input, and children cells corresponding to those parents being used to compute an output representation. This sparsity combined with other architecture optimizations improves SPARTAN's throughput by over 90% during inference on a Raspberry Pi 4 when compared to PE baselines (adapters) while also outperforming the latter by 0.1 points on the GLUE benchmark. Further, it can be trained 34% faster in a few-shot setting, while performing within 0.9 points of adapters. Qualitative analysis shows that different parent cells in SPARTAN specialize in different topics, thus dividing responsibility efficiently.
翻译:在一系列下游任务中,微调预先培训语言模型(PLM)取得了令人印象深刻的成绩,因此其规模正在扩大。由于每个任务都需要不同版本的模型,对于存储限制的边端设备,例如移动电话,这种模式是行不通的。在本文件中,我们提议SPARTAN,一个参数高效(PE)和计算快速的边端设备结构,它在每个变异层后增加分级分散的记忆。SPARTAN冻结了PLM参数和微调记忆,因此通过重新使用PLM主干线执行不同任务,大大降低了储存成本。SPARTAN包含两个记忆级别,在第一个输入级别上只选择少量的家长,而与父母对应的儿童细胞则用来计算输出代表。这种宽度与其他结构优化相结合的边端设备使SPARTAN的吞吐量增加90%以上。与PE的基线(适应者)相比,Sasperry Pi 4 4 将储量冻结,从而大大降低储量成本,同时在特殊GLUE基准上也比低0.1分。SPAN基准中,只有两个级的存储部分,只有少量的存储部分。因此,在父母级分析中可以更快地进行不同分析。