Automatic speech recognition research focuses on training and evaluating on static datasets. Yet, as speech models are increasingly deployed on personal devices, such models encounter user-specific distributional shifts. To simulate this real-world scenario, we introduce LibriContinual, a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks, with data corresponding to 118 individual speakers and 6 train splits per speaker of different sizes. Additionally, current speech recognition models and continual learning algorithms are not optimized to be compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR and create a novel Conformer variant called the DisConformer (Disentangled Conformer). This algorithm produces ASR models consisting of a frozen 'core' network for general-purpose use and several tunable 'augment' networks for speaker-specific tuning. Using such models, we propose a novel compute-efficient continual learning algorithm called DisentangledCL. Our experiments show that the DisConformer models significantly outperform baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On speaker-specific LibriContinual they significantly outperform trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even match fully finetuned baselines in some settings.
翻译:自动语音识别研究侧重于静态数据集的培训和评估。 然而, 随着语音模型越来越多地被安装在个人设备上, 这些模型会遇到用户特有的分布变换。 为了模拟这个真实世界的情景, 我们引入了LibriConstantial, 这是来自LibriVox音频书的语音专用域适应的持续学习基准, 其数据与118名个人发言者和6名不同大小的发言者相对应, 以及6名列列分解。 此外, 当前的语音识别模型和持续学习算法并不优化, 无法进行高效的计算 。 我们的实验显示, 在一般 ASR 上, 我们为 ASR 调整了一个通用的通用培训算法 NetAug, 并创建了一个名为 Disconfecter( Discondcer) 的新型变异体变体。 这个算法生成了由用于一般用途的“ 核心” 网络和若干 Tunableblemental 的“ 推荐” 网络组成的ASR 模型。 我们建议使用这样的模型, commalled explace- laveal- practal- practal- practal- practal- practal- practal- practal- practal- practy