用于高质量大型语音数据集开发的可缩放数据说明管道 (Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development)

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper.

翻译：本文介绍“人与机”数据说明管道,以生成高质量、大型语音数据集。管道将人与机的优势结合起来,以更快、准确和具有成本效益的方式用机器预贴标签和完全人工审计的方式,更快、准确和准确地说明数据集。在批注管道中采用了诸如盲检测、行为监测和数据验证等质量控制机制,以减少机器生成标签的潜在偏差。我们的A/B测试和试点结果显示,HITL输油管道可以至少提高80%的注解速度和能力,且其质量可比手动双传记要高或更高。我们正在利用这一可缩放的管道,创建并持续增加超大量的超现版(UHV-OTS)语类语音库,每年能够扩大到每种语言1万小时以上。从UHV-OTS公司使用动态包装进行自定义数据集。UHV-OTS公司是一个长期的纸介项目,用于支持在线语言用户社区使用免费语音研究成果。根据CCSAppen-SAppenal 数据库的免费搜索源码,也用于支持BILSAND数据库处理。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【2020新书】高级Python编程，620页pdf

专知会员服务

240+阅读 · 2020年7月31日

【Manning新书】现代Java实战，592页pdf

专知会员服务

101+阅读 · 2020年5月22日