This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work. We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset. These subsets are restricted to 50\,k audio files (versus over 1\,M files available), and vary on the axis of number of speakers and session variability. We train three speaker recognition systems on these subsets; the X-vector, ECAPA-TDNN, and wav2vec2 network architectures. We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited. Code and data subsets are available at https://github.com/nikvaessen/w2v2-speaker-few-samples.
翻译:这项工作考虑对神经网络进行培训,以便与当代工作相比,以小得多的数据集大小来识别发言者。我们人为地限制数据数量,方法是提出受欢迎的VoxCeleb2数据集的三个子集。这些子集限于50k音频文件(Vversus over 1\,M文件可用),在发言者人数和会话变异的轴心上有所不同。我们对这些子集培训了三个发言者识别系统:X-Vector, ECAPA-TDNN, 和 wav2vec2网络结构。我们显示,在培训数据有限时,Wav2vec2的自我监督、预先训练的重量大大改进了性能。代码和数据子集可在https://github.com/nikvaessen/w2v2-speaker-few-samples上查阅。</s>