Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate that the state-of-the-art Conformer models generally have multiple ambient layers. We study the stability of these layers across runs and model sizes, propose that group normalization may be used without disrupting their formation, and examine their correlation with model weight updates in each layer. Finally, we apply these findings to Federated Learning in order to improve the training procedure, by targeting Federated Dropout to layers by importance. This allows us to reduce the model size optimized by clients without quality degradation, and shows potential for future exploration.
翻译:以变异器为基础的结构一直是旨在了解其超度参数化及其层块的不统一重要性的研究对象。运用这些方法来自动语音识别,我们证明最先进的变异器模型通常具有多个环境层。我们对这些层的稳定性进行了跨运行和模型大小的研究,建议可以在不干扰其形成的情况下使用群落正常化,并研究它们与每个层的模型重量更新的相关性。最后,我们将这些研究结果应用到联邦学习组织,以便通过将联邦辍学者作为重要对象来改进培训程序。这使我们能够减少客户优化的模型规模,不出现质量退化,并展示未来探索的潜力。