This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.
翻译:本文介绍“人与机”数据说明管道,以生成高质量、大型语音数据集。管道将人与机的优势结合起来,以更快、准确和具有成本效益的方式用机器预贴标签和完全人工审计的方式,更快、准确和准确地说明数据集。在批注管道中采用了诸如盲检测、行为监测和数据验证等质量控制机制,以减少机器生成标签的潜在偏差。我们的A/B测试和试点结果显示,HITL输油管道可以至少提高80%的注解速度和能力,且其质量可比手动双传记要高或更高。我们正在利用这一可缩放的管道,创建并持续增加超大量的超现版(UHV-OTS)语类语音库,每年能够扩大到每种语言1万小时以上。从UHV-OTS公司使用动态包装进行自定义数据集。UHV-OTS公司是一个长期的纸介项目,用于支持在线语言用户社区使用免费语音研究成果。根据CCSAppen-SAppenal 数据库的免费搜索源码,也用于支持BILSAND数据库处理。