Automatic Speech Recognition (ASR) has increasing utility in the modern world. There are a many ASR models available for languages with large amounts of training data like English. However, low-resource languages are poorly represented. In response we create and release an open-licensed and formatted dataset of audio recordings of the Bible in low-resource northern Indian languages. We setup multiple experimental splits and train and analyze two competitive ASR models to serve as the baseline for future research using this data.
翻译:自动语音识别(ASR)在现代世界越来越有用,许多ASR模式可供英语等有大量培训数据的语言使用,但是,低资源语言代表不足,因此,我们创建和发行了一本公开许可和格式化的印度北部低资源语言的《圣经》录音数据集,我们设置了多种实验分解,培训并分析了两种竞争性的ASR模式,作为今后使用这些数据进行研究的基线。