Large pretrained language models (PreLMs) are revolutionizing natural language processing across all benchmarks. However, their sheer size is prohibitive for small laboratories or for deployment on mobile devices. Approaches like pruning and distillation reduce the model size but typically retain the same model architecture. In contrast, we explore distilling PreLMs into a different, more efficient architecture, Continual Multiplication of Words (CMOW), which embeds each word as a matrix and uses matrix multiplication to encode sequences. We extend the CMOW architecture and its CMOW/CBOW-Hybrid variant with a bidirectional component for more expressive power, per-token representations for a general (task-agnostic) distillation during pretraining, and a two-sequence encoding scheme that facilitates downstream tasks on sentence pairs, such as sentence similarity and natural language inference. Our matrix-based bidirectional CMOW/CBOW-Hybrid model is competitive to DistilBERT on question similarity and recognizing textual entailment, but uses only half of the number of parameters and is three times faster in terms of inference speed. We match or exceed the scores of ELMo for all tasks of the GLUE benchmark except for the sentiment analysis task SST-2 and the linguistic acceptability task CoLA. However, compared to previous cross-architecture distillation approaches, we demonstrate a doubling of the scores on detecting linguistic acceptability. This shows that matrix-based embeddings can be used to distill large PreLM into competitive models and motivates further research in this direction.
翻译:大型预先培训语言模型(PreLMS)正在使自然语言处理在所有基准上革命性地改变自然语言处理。 但是,它们的纯度对于小型实验室或移动设备部署来说是令人望而却步的。 裁剪和蒸馏等方法减少了模型的规模,但通常保留同样的模型结构。 相反,我们探索将PreLMS蒸馏成一个不同的、更有效率的架构, 连续重复 Words (CMOW), 将每个单词嵌入一个矩阵, 并使用矩阵乘以编码序列。 我们将CMOW架构及其 CMOW/CBOW- Hybrid 变异体扩展成一个双向部分, 用于更清晰的表达能力, 普通(task- Anotistictical) 蒸馏时, 双向双向的表达式表达方式, 在前一和后一阶段的排序中,SST-L任务显示前三倍的递增速度。