Transformer-based end-to-end speech recognition has achieved great success. However, the large footprint and computational overhead make it difficult to deploy these models in some real-world applications. Model compression techniques can reduce the model size and speed up inference, but the compressed model has a fixed architecture which might be suboptimal. We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. With a similar number of layers at inference time, I3D-based models outperform the vanilla Transformer and the static pruned model via iterative layer pruning. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
翻译:以变换器为基础的终端到终端语音识别取得了巨大成功。 然而, 巨大的足迹和计算间接成本使得难以在现实世界的某些应用中应用这些模型。 模型压缩技术可以降低模型大小并加快推导速度, 但压缩模型有一个固定的架构, 可能不理想。 我们提出一个带有输入依赖动态深度( I3D) 的新型变换器编码器, 以实现强大的性能- 效率权衡。 在推论时间, 类似多层, 以 I3D 为基础的模型超越了香草变换器和通过迭接层裁剪的静态小处理模型。 我们还介绍了关于大门概率和输入依赖性的有趣分析, 这有助于我们更好地理解深度变换码器。</s>