With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and model sharding. By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains. Furthermore, our models learn powerful speech representations with zero and few-shot capabilities on novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively, while realizing the best performance on public social media videos. Furthermore, the same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset.
翻译:通过120个国家10个不同来源的450万小时英语演讲,以及高达100亿参数的模型,我们探索了自动语音识别的规模界限。我们提出数据选择技术,以高效地扩大培训数据规模,在大规模数据集中找到最有价值的样本。为了高效地规模模型规模,我们利用各种优化方法,如稀散的传感器损失和模型碎片。通过培训1-10B参数普及英语ASR模型,我们推向了在许多领域语音识别表现的限度。此外,我们的模型学习了新颖领域和语音风格的强力语音表现,零和几发能力,超过了多个内部和公共基准的以往结果。对于因脑损伤而出现障碍的演讲者,我们最好的零发和少发模型分别在Aphasia Bank测试集上实现了22%和60%的相对改进,同时在公共社交媒体视频上实现了最佳表现。此外,同样的通用模型在SPGISpeech财务数据集上实现了等同的500x内部数据。