This paper explains our work in developing new acoustic models for automated speech recognition (ASR) at KBLab, the infrastructure for data-driven research at the National Library of Sweden (KB). We evaluate different approaches for a viable speech-to-text pipeline for audiovisual resources in Swedish, using the wav2vec 2.0 architecture in combination with speech corpuses created from KB's collections. These approaches include pretraining an acoustic model for Swedish from the ground up, and fine-tuning existing monolingual and multilingual models. The collections-based corpuses we use have been sampled from millions of hours of speech, with a conscious attempt to balance regional dialects to produce a more representative, and thus more democratic, model. The acoustic model this enabled, "VoxRex", outperforms existing models for Swedish ASR. We also evaluate combining this model with various pretrained language models, which further enhanced performance. We conclude by highlighting the potential of such technology for cultural heritage institutions with vast collections of previously unlabelled audiovisual data. Our models are released for further exploration and research here: https://huggingface.co/KBLab.
翻译:本文解释了我们在KBLab为自动语音识别(ASR)开发新的声音模型的工作,KBLab是瑞典国家图书馆(KB)数据驱动研究的基础设施。我们评估瑞典视听资源可行的语音到文字管道的不同方法,使用 wav2vec 2.0 架构,结合KB收藏的语音内容。这些方法包括:从地下开始为瑞典人预先培训一个声音模型,并微调现有的单语和多语种模型。我们使用的基于收藏的拼图是从数百万小时的语音中抽样的,有意识地试图平衡区域方言,以产生更具代表性,从而更加民主的模型。这个“VoxRex”启用的声学模型优于瑞典视听资源的现有模型。我们还评估了将这一模型与各种预先培训的语言模型相结合,进一步提高了绩效。我们最后通过强调这些技术对文化遗产机构的潜力,大量收集了先前没有标签的视听数据。我们使用的模式被放在这里进行进一步探讨和研究:https://huggingface.co/KBLab。