End-to-end spoken language understanding (SLU) systems benefit from pretraining on large corpora, followed by fine-tuning on application-specific data. The resulting models are too large for on-edge applications. For instance, BERT-based systems contain over 110M parameters. Observing the model is overparameterized, we propose lean transformer structure where the dimension of the attention mechanism is automatically reduced using group sparsity. We propose a variant where the learned attention subspace is transferred to an attention bottleneck layer. In a low-resource setting and without pre-training, the resulting compact SLU model achieves accuracies competitive with pre-trained large models.
翻译:端对端口语理解系统受益于大型公司学前培训,随后对具体应用数据进行微调,由此得出的模型过于庞大,无法用于前沿应用。例如,基于BERT的系统包含110M参数。观察该模型时,我们提议了低压变压器结构,通过群体聚度自动减少关注机制的维度。我们提议了一种变式,将学到的注意力子空间转移到关注瓶颈层。在低资源环境和没有培训前,产生的紧凑 SLU 模型与预先培训的大型模型相比,获得了优胜竞争。