While transformers and their variant conformers show promising performance in speech recognition, the parameterized property leads to much memory cost during training and inference. Some works use cross-layer weight-sharing to reduce the parameters of the model. However, the inevitable loss of capacity harms the model performance. To address this issue, this paper proposes a parameter-efficient conformer via sharing sparsely-gated experts. Specifically, we use sparsely-gated mixture-of-experts (MoE) to extend the capacity of a conformer block without increasing computation. Then, the parameters of the grouped conformer blocks are shared so that the number of parameters is reduced. Next, to ensure the shared blocks with the flexibility of adapting representations at different levels, we design the MoE routers and normalization individually. Moreover, we use knowledge distillation to further improve the performance. Experimental results show that the proposed model achieves competitive performance with 1/3 of the parameters of the encoder, compared with the full-parameter model.
翻译:虽然变压器及其变压器在语音识别方面表现良好,但参数化属性在培训和推断过程中导致大量内存成本。有些作品使用跨层权重共享来减少模型参数。然而,不可避免的能力损失会损害模型性能。为解决这一问题,本文件建议通过分享稀疏的专家来提供一个具有参数效率的兼容器。具体地说,我们使用稀疏的专家混合体(MOE)在不增加计算的情况下扩大一个校正区块的容量。然后,共享组装组群的参数,以减少参数数量。接下来,为确保共享区块与在不同级别调整代表的灵活性,我们设计MOE路由器并实现个体正常化。此外,我们利用知识蒸馏来进一步改进性能。实验结果显示,与全参数模型相比,拟议的模型取得了三分之一的竞争性性能。