Research on speaker recognition is extending to address the vulnerability in the wild conditions, among which genre mismatch is perhaps the most challenging, for instance, enrollment with reading speech while testing with conversational or singing audio. This mismatch leads to complex and composite inter-session variations, both intrinsic (i.e., speaking style, physiological status) and extrinsic (i.e., recording device, background noise). Unfortunately, the few existing multi-genre corpora are not only limited in size but are also recorded under controlled conditions, which cannot support conclusive research on the multi-genre problem. In this work, we firstly publish CN-Celeb, a large-scale multi-genre corpus that includes in-the-wild speech utterances of 3,000 speakers in 11 different genres. Secondly, using this dataset, we conduct a comprehensive study on the multi-genre phenomenon, in particular the impact of the multi-genre challenge on speaker recognition and the performance gain when the new dataset is used to conduct multi-genre training.
翻译:有关扬声器识别的研究正在扩大,以解决在野生条件下的脆弱性问题,其中,族系不匹配可能是最具挑战性的问题,例如,在用谈话或歌唱音音音进行测试时,会以阅读语言注册,这种不匹配导致复杂的和复合的会间变异,既有内在的(即,语言风格、生理状态),也有外在的(即,录音装置、背景噪音),不幸的是,现有的少数多族族群不仅体积有限,而且记录在受控制的条件下,无法支持对多族系问题进行结论性研究。在这项工作中,我们首先出版了一个大型的CN-Celeb多族群,其中包括11种不同族系3 000名发言者的亲身讲话。第二,我们利用这一数据集,对多族现象进行全面研究,特别是多族系挑战对语音识别的影响,以及在使用新数据集进行多族系培训时的性能收益。