In this paper, we present our progress in pretraining Czech monolingual audio transformers from a large dataset containing more than 80 thousand hours of unlabeled speech, and subsequently fine-tuning the model on automatic speech recognition tasks using a combination of in-domain data and almost 6 thousand hours of out-of-domain transcribed speech. We are presenting a large palette of experiments with various fine-tuning setups evaluated on two public datasets (CommonVoice and VoxPopuli) and one extremely challenging dataset from the MALACH project. Our results show that monolingual Wav2Vec 2.0 models are robust ASR systems, which can take advantage of large labeled and unlabeled datasets and successfully compete with state-of-the-art LVCSR systems. Moreover, Wav2Vec models proved to be good zero-shot learners when no training data are available for the target ASR task.
翻译:在本文中,我们展示了在捷克单一语言音频变压器从包含8万多小时无标签演讲的大型数据集中预培训捷克单一语言音频变压器的进展,并随后利用内部数据和近6千小时无标签转录发言相结合,对自动语音识别任务模型进行了微调,我们展示了对两个公共数据集(CommonVoice和VoxPopuli)和一个极具挑战性的数据集进行评估的各种微调组合的大批实验。我们的结果显示,单语言Wav2Vec 2. 0模型是强大的ASR系统,能够利用大标签和无标签数据集,并成功地与最新LVCSR系统竞争。此外,当没有用于目标ASR任务的培训数据时,Wav2Vec模型被证明是很好的零发学生。