利用公共数据提高采矿音频和文字对质效力,改进低资源语言ASR系统 (Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages)

End-to-end (E2E) models have become the default choice for state-of-the-art speech recognition systems. Such models are trained on large amounts of labelled data, which are often not available for low-resource languages. Techniques such as self-supervised learning and transfer learning hold promise, but have not yet been effective in training accurate models. On the other hand, collecting labelled datasets on a diverse set of domains and speakers is very expensive. In this work, we demonstrate an inexpensive and effective alternative to these approaches by ``mining'' text and audio pairs for Indian languages from public sources, specifically from the public archives of All India Radio. As a key component, we adapt the Needleman-Wunsch algorithm to align sentences with corresponding audio segments given a long audio and a PDF of its transcript, while being robust to errors due to OCR, extraneous text, and non-transcribed speech. We thus create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages totalling to 4.95M sentences. On average, Shrutilipi results in a 2.3x increase over publicly available labelled data. We establish the quality of Shrutilipi with 21 human evaluators across the 12 languages. We also establish the diversity of Shrutilipi in terms of represented regions, speakers, and mentioned named entities. Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5.8\% for 7 languages on the IndicSUPERB benchmark. For Hindi, which has the most benchmarks (7), the average WER falls from 18.8% to 13.5%. This improvement extends to efficient models: We show a 2.3% drop in WER for a Conformer model (10x smaller than Wav2Vec). Finally, we demonstrate the diversity of Shrutilipi by showing that the model trained with it is more robust to noisy input.

翻译：端到端模式( E2E) 已经成为最先进的语音识别系统的默认选项。这些模式在大量贴标签数据上得到了培训, 通常没有用于低资源语言。自我监管的学习和传输学习等技术很有希望, 但是在培训准确模型方面还没有发挥效力。另一方面, 在一系列不同的域和演讲者中收集贴标签的数据集非常昂贵。在这项工作中, 我们通过“ 挖掘' 最先进的语音识别系统语言的文本和音频配对 ” 来展示一种价格低廉和有效的替代方法。特别是来自全印度电台的公共档案。作为关键组成部分, 我们调整了Nedeleleman- Wunsch 算法, 将相关音频部分的句子与长期音频和PDFS的音频部分相匹配, 而对于由于OCR、超文本文本文本文本和非调的语句错误,我们创建了Shrtredic 模型。我们提到了Srutyliplipi, 12个印度语的音频标有超过6400小时的音频标, 总计为4.95M 句。。平均, Strutilippirelippialpireal 的结果在2.3 rde 中也显示了比 2.3rdeal lax 数据中显示了比。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日