In this paper, we contribute a novel and extensive dataset for speaker verification, which contains noisy 38k identities/1.45M utterances (VoxSnap) and relatively cleaned 18k identities/1.02M (VoxSnap-Clean) utterances for training. Firstly, we collect a 60K+ users' list as well as their avatar and download their SHORT videos on the YouTube. Then, an automatically pipeline is devised to extract target user's speech segments and videos, which is efficient and scalable. To the best of our knowledge, the VoxSnap dataset is the largest speaker recognition dataset. Secondly, we develop a series of experiments based on VoxSnap-clean together with VoxCeleb2. Our findings highlight a notable improvement in performance, ranging from 15% to 30%, across different backbone architectures, upon integrating our dataset for training. The dataset will be released SOON~.
翻译:暂无翻译