损失可成为祝福:通过自我支持的演讲代表实现高效率的多种语言和多种任务语言的演讲处理 (Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing)

Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.

翻译：对丰富的语音演示进行自我监督的学习(SSL)在低资源自动语音识别(ASR)和其他语音处理任务方面取得了经验上的成功,这可以减轻大量转录语音的必要性,从而促使对在线自动语音和其他语音处理的需求日益增长,然而,先进的语音SSL模式变得日益庞大,这与有限的在线资源相矛盾。在需要同时承认多种语言或执行多种语音处理任务的多语言/多任务假设中,这一差距可能更为严重。此外,在对低资源语音资料库进行微调时,严重过度分解的语音SSSL模型往往会因过度安装而受害。这项工作的目的是通过我们提议的SLS3美元-Routal处理和其他语音处理程序的双赢、提高效率和缓解过度使用SL3美元-ROTL模式。首次发现,仅仅通过微调语音语音模型连接模式连接而抛弃超过10 ⁇ 的模型重量,就能在下游语音处理任务的标准重度调整上实现更准确的准确度。更重要的是,S_3美元-Roter-3美元-Router模型的实用性使用将S-S-res-latal 用于新的智能智能智能智能技术。