Recently, end-to-end (E2E) speech recognition has become popular, since it can integrate the acoustic, pronunciation and language models into a single neural network, which outperforms conventional models. Among E2E approaches, attention-based models, e.g. Transformer, have emerged as being superior. Such models have opened the door to deployment of ASR on smart devices, however they still suffer from requiring a large number of model parameters. We propose an extremely low footprint E2E ASR system for smart devices, to achieve the goal of satisfying resource constraints without sacrificing recognition accuracy. We design cross-layer weight sharing to improve parameter efficiency and further exploit model compression methods including sparsification and quantization, to reduce memory storage and boost decoding efficiency. We evaluate our approaches on the public AISHELL-1 and AISHELL-2 benchmarks. On the AISHELL-2 task, the proposed method achieves more than 10x compression (model size reduces from 248 to 24MB), at the cost of only minor performance loss (CER reduces from 6.49% to 6.92%).
翻译:最近,端到端(E2E)语音识别变得很受欢迎,因为它可以将声学、发音和语言模型纳入一个超常规模型的单一神经网络,超过常规模型。在E2E方法中,基于关注的模型,例如变异器,已经出现优势。这些模型打开了将ASR用于智能装置的大门,但它们仍然需要大量模型参数。我们提议对智能装置采用极低的足迹E2E ASR系统,以便在不牺牲识别精确度的情况下满足资源限制的目标。我们设计跨层重量共享,以提高参数效率并进一步利用模型压缩方法,包括垃圾化和量化,减少存储记忆,提高解码效率。我们评估了我们关于AISHELL-1和AISHELL-2公共基准的方法。在AISELL-2任务中,拟议方法实现了10倍以上的压缩(模型尺寸从248降至24MB),其成本仅为轻微性能损失(CER从6.49 %降至6.92%)。