Current state of the art acoustic models can easily comprise more than 100 million parameters. This growing complexity demands larger training datasets to maintain a decent generalization of the final decision function. An ideal dataset is not necessarily large in size, but large with respect to the amount of unique speakers, utilized hardware and varying recording conditions. This enables a machine learning model to explore as much of the domain-specific input space as possible during parameter estimation. This work introduces Common Phone, a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation. A Wav2Vec 2.0 acoustic model was trained with the Common Phone to perform phonetic symbol recognition and validate the quality of the generated phonetic annotation. The architecture achieved a PER of 18.1 % on the entire test set, computed with all 101 unique phonetic symbols, showing slight differences between the individual languages. We conclude that Common Phone provides sufficient variability and reliable phonetic annotation to help bridging the gap between research and application of acoustic models.
翻译:艺术声学模型目前的状况很容易包含超过1亿个参数。 这种日益复杂的复杂性要求增加培训数据集,以保持对最终决定功能的体面概括化。 理想的数据集规模不一定很大,但对于独特的扬声器、使用硬件和不同记录条件的数量而言,则很大。 这使机器学习模型能够在参数估计期间尽可能多地探索特定领域的输入空间。 这项工作引入了共同电话,这是一个性别平衡的多语种,通过Mozilla的通用语音项目记录了76 000多个贡献者提供的多语种。 它包含大约116小时的语音,通过自动生成的音断层进行丰富。 Wav2Vec 2.0 声学模型与通用电话进行了培训,以进行语音符号识别并验证生成的语音注的质量。 该结构在整个测试集中实现了18.1%的PER,以所有101个独特的语音符号计算,显示了个别语言之间的微小差异。 我们的结论是,共同电话提供了足够的变异性和可靠的音说明,有助于缩小声学模型的研究和应用之间的差距。