We propose a unified framework for low resource automatic speech recognition tasks named meta audio concatenation (MAC). It is easy to implement and can be carried out in extremely low resource environments. Mathematically, we give a clear description of MAC framework from the perspective of bayesian sampling. In this framework, we leverage a novel concatenative synthesis text-to-speech system to boost the low resource ASR task. By the concatenative synthesis text-to-speech system, we can integrate language pronunciation rules and adjust the TTS process. Furthermore, we propose a broad notion of meta audio set to meet the modeling needs of different languages and different scenes when using the system. Extensive experiments have demonstrated the great effectiveness of MAC on low resource ASR tasks. For CTC greedy search, CTC prefix, attention, and attention rescoring decode mode in Cantonese ASR task, Taiwanese ASR task, and Japanese ASR task the MAC method can reduce the CER by more than 15\%. Furthermore, in the ASR task, MAC beats wav2vec2 (with fine-tuning) on common voice datasets of Cantonese and gets really competitive results on common voice datasets of Taiwanese and Japanese. Among them, it is worth mentioning that we achieve a \textbf{10.9\%} character error rate (CER) on the common voice Cantonese ASR task, bringing about \textbf{30\%} relative improvement compared to the wav2vec2 (with fine-tuning).
翻译:我们为低资源自动语音识别任务提出了一个统一框架,称为元音调调调调(MAC),这是很容易执行的,也可以在极低的资源环境中执行。从数学角度,我们从刺客抽样的角度对MAC框架作出清晰描述。在这个框架内,我们利用一个新的混合文本对语音综合系统来推动低资源ASR的任务。通过混合综合文本对语音系统,我们可以整合语言发音规则并调整TTS进程。此外,我们提出了一套广泛的元音集概念,以满足不同语言和不同场景的建模需求。从数学角度,我们从刺客抽样抽样抽样取样的角度看,我们对MAC框架框架框架框架框架进行了清晰的描述。对于反恐委员会的贪婪搜索、CT前置、关注和注意力将解码模式重新定位在广州ASR任务中,MAC方法可以将CER的语音减少15个以上。此外,在ASR任务中,MAC系统对不同语言的建档字符比值为wavSR2的相对性2,在普通数据中,我们在共同语音中取得共同的Riaxxx(我们共同的平调),在普通数据中,在普通数据中取得共同的平调率中,在普通数据中,在普通数据中取得共同的平调。