Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.
翻译:基础模型从不同模式、语言和应用领域等各种数据来源中受益,基础模型展示了强大的概括化和知识转让能力。在本文件中,我们介绍了为建立基于调频的语音识别系统的有效解决方案而开展的一项开拓性研究。我们采用了最近开发的用于培训的自我监督的最佳选择方案,并提议对使用“Just Hydra”的源数据和不受监督的目标域数据进行联合微调。调频编码器和调解调器随后以少量受监督的内域数据对目标域进行微调。在大规模YouTube和语音搜索任务方面,我们的方法显示既具有数据效率,也具有模型参数效率。与仅21.6M监督的现场数据和130.8M微调参数相比,我们达到了同样的质量,只有21.6M监督了内部数据和130.8M微调参数,而从从抓起对另外300M监测的300M数据进行了培训的731.1M模型。