The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions. We describe our data collection methodology and release our data collection system under the Apache 2.0 license. We show that a model trained on this dataset achieves a 9.98% word error rate on Librispeech's test-clean test set.Finally, we discuss the legal and ethical issues surrounding the creation of a sizable machine learning corpora and plans for continued maintenance of the project under MLCommons's sponsorship.
翻译:《人民演说》是一个可自由下载的30 000小时且不断增长的有监督的谈话性英语语音识别数据集,根据CC-BY-SA(与CC-BY子集)为学术和商业用途颁发了许可证,这些数据是通过搜索互联网收集的,用现有的抄录适当许可的音频数据。我们描述了我们的数据收集方法,并根据Apache 2.0的许可证发布我们的数据收集系统。我们显示,在这个数据集上受过培训的模型在Librispeech的测试干净测试集中达到了9.98%的字差差率。最后,我们讨论了围绕建立一个可测量的机器学习公司以及继续维护在刚果解放运动赞助下的项目的计划而存在的法律和伦理问题。