Large pre-trained multilingual models like mBERT, XLM-R achieve state of the art results on language understanding tasks. However, they are not well suited for latency critical applications on both servers and edge devices. It's important to reduce the memory and compute resources required by these models. To this end, we propose pQRNN, a projection-based embedding-free neural encoder that is tiny and effective for natural language processing tasks. Without pre-training, pQRNNs significantly outperform LSTM models with pre-trained embeddings despite being 140x smaller. With the same number of parameters, they outperform transformer baselines thereby showcasing their parameter efficiency. Additionally, we show that pQRNNs are effective student architectures for distilling large pre-trained language models. We perform careful ablations which study the effect of pQRNN parameters, data augmentation, and distillation settings. On MTOP, a challenging multilingual semantic parsing dataset, pQRNN students achieve 95.9\% of the performance of an mBERT teacher while being 350x smaller. On mATIS, a popular parsing task, pQRNN students on average are able to get to 97.1\% of the teacher while again being 350x smaller. Our strong results suggest that our approach is great for latency-sensitive applications while being able to leverage large mBERT-like models.
翻译:在语言理解任务上, 诸如 mBERT、 XLM- R 等经过预先培训的大型多语言模型在语言理解任务上取得了最新水平的成绩。 但是, 它们并不完全适合服务器和边缘设备上的延缓关键应用程序。 重要的是要减少这些模型所需要的记忆和计算资源。 为此, 我们提议 PQRNN, 是一个小于投影的嵌入式无神经编码器, 对自然语言处理任务来说是小而有效的。 没有预培训, pQRMNNs 大大超越了具有预先培训的嵌入功能的LSTM 模型。 由于参数数量相同, 它们超越了变异器基线, 从而展示了它们的参数效率。 此外, 我们显示, PQRNNNes 是用于蒸馏大型预先培训语言模型的有效学生结构。 我们进行谨慎的推算, 研究PQRNNN参数、 数据增强和蒸馏环境的效果。 在MOPPOP上, 一个具有挑战性的多语种语种解式数据设置, pQRNNNN学生取得了95 ⁇ 的成绩, 而一个更小的 mERT 老师的模小的成绩, 而我们的平均任务又成为了350NIS 。